Intro

The goal of this analysis is an untargeted (data-driven) multi-omic analysis of metabolomics and 16S microbiome data generated from an ex vivo fecal incubation study. In this study, broccoli sprouts (broc), brussels sprouts (brus), and a combination of the two (combo), as well as a negative control (no_veg) underwent in vitro digestion. Following in vitro digestion, the chyme was incubated with human fecal samples (n = 40) for 24 hours followed by 16S sequencing and untargeted metabolomics analysis (our paper in Nutrients https://www.mdpi.com/2072-6643/13/9/3013 for more methodological details).

Our previous work with this dataset identified two separate microbiome profiles within our subjects: some subjects had microbiomes enriched with bacteria from taxa Clostridiaceae (C_Type individuals in this analysis) while others had microbiomes enriched with bacteria from taxa Enterobacteriaceae (E_Type individuals). Individuals with microbiomes rich in members of Family Clostridiaceae were found to have a higher abundance of sulforaphane-nitrile in their digesta compared to individuals with microbiomes rich in Enterobacteriaceae. The composition of the microbiomes and a PCoA plot are shown below.

setwd('~/Documents/Projects/poopsoup/poopsoup_two/')
library(tidyverse)
library(magrittr)
library(ggfortify)
library(lme4)
library(lmerTest)
library(factoextra)
library(phyloseq)
library(cowplot)
library(plotly)
library(remMap)
library(RGCCA)

#Helper functions: 
log_helper <- function(x, min.val){
  log2((x + sqrt(x^2 + min.val^2))/2)
}
#Pareto Scaling:
PS_helper <- function(x){
  (x - mean(x))/sqrt(sd(x, na.rm = T))
}   

#Transformation Functions:
#Log Scaling:
log_transform <- function(mtb){
  mtb_nz <- mtb[ ,which(apply(mtb, 2, sum) != 0)]
  min.val <- min(abs(mtb_nz[mtb_nz!=0]))/10
  mtb_log_trans <- apply(mtb_nz, 2, log_helper, min.val)
  return(mtb_log_trans)
}

#Pareto Scaling:
pareto_scale <- function(mtb){
  mtb_scaled <- apply(mtb, 2, PS_helper) 
  return(mtb_scaled)
}
asvtab <- readRDS('../data/microbiome/asv_tab.RDS')
taxatab <- readRDS('../data/microbiome/tax_tab.RDS')
metadata <- read.csv('../data/microbiome/microbiome_metadata.csv')

metadata %<>% mutate(group = ifelse(group == 'A', 'C_Type', 'E_type'))

#Factorize and set the levels on the metadata
metadata$treatment %<>% factor(levels = c('fecal_stock', 'no_veg', 'broc', 'brus', 'combo', 'control_digest'))
metadata$fecal_sample %<>% factor(levels = c('T5631','T5632','T6260','T6291','T4669','T1995','T5627','T5717','T5854','T6382')) 

rownames(metadata) <- metadata$sample
rownames(asvtab) <-metadata$sample
ps_raw <- phyloseq(otu_table(asvtab, taxa_are_rows = FALSE),
               sample_data(metadata),
               tax_table(taxatab))

#Give arbitrary names to the taxa as opposed to keeping as just DNA-sequences which identify them
taxa_names(ps_raw) <- paste0("ASV", seq(ntaxa(ps_raw)))

#Fill in missing genus names:
renames <- rownames(tax_table(ps_raw)[is.na(tax_table(ps_raw)[, 'Genus'])])
taxdf <- tax_table(ps_raw)[renames,]
renamed_genus <- unname(sapply(taxa_names(taxdf), function(x) paste0('f_', taxdf[x, 'Family'], '_', x)))
tax_table(ps_raw)[renames, 'Genus'] <- renamed_genus

#Remove the control digests, these are not relevant to our analysis
ps_raw <- ps_raw %>% subset_samples(treatment != 'control_digest')
#Agglomerate to the genus level
ps_genera <- ps_raw %>% tax_glom(taxrank = "Genus")
#Remove taxa not seen more than 3 times in at least 20% of the samples
ps_counts <- ps_genera %>% filter_taxa(function(x) sum(x > 3) > (0.2*length(x)), TRUE)
#Convert from counts to relative abundance
ps_relab <- ps_counts %>% transform_sample_counts(function(x) x / sum(x) )
#Filter out low abundance (>1e-5) taxa
ps <- ps_relab %>% filter_taxa(function(x) mean(x) > 1e-5, TRUE)

ps_final <- ps %>% subset_samples(treatment != 'fecal_stock')

microdata <- cbind(sample = data.frame(sample_data(ps_final))$metabolomics_neg_sample, data.frame(otu_table(ps_final))) %>%
  mutate(sample = gsub('neg', 'ms', sample))
all_alpha <- plot_bar(ps_final, fill = 'Family', x = 'treatment') + 
  facet_wrap(~fecal_sample, ncol = 5) + 
  theme_cowplot() + theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 12), 
                          axis.text.y = element_text(size = 12),
                          axis.title.x = element_blank(),
                          axis.title.y = element_blank()) +
  ggtitle('Relative Abundance of Bacterial Taxa Following Incubation')
xaxis <- list(title = 'Treatment',automargin = TRUE)

ggplotly(all_alpha, width = 900, height = 600) %>% layout(autosize = F, xaxis = xaxis) 
ordBCall <- ordinate(ps_final, method = 'PCoA', distance = 'bray')
plot_ordination(ps_final, ordBCall, color = 'treatment') +
  geom_point(aes(text = paste0('Fecal Sample: ', fecal_sample, '\n',
                               'Treatment: ', treatment, '\n',
                               'Group', group)),
             size = 3) +
  stat_ellipse(aes(group = group, fill = group), alpha = 0.2, geom = 'polygon') +
  theme_minimal_grid() +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  ggtitle('Beta Diversity of All Samples') 

The goal of this analysis is to identify microbial taxa and metabolomic features identified associated with cruciferous vegetable consumption. Additionally, we aim to identify more metabolites associated with two microbiome profiles identified previously.

Data Generation Methods

Microbiome

DNA was isolated from fecal cultures (post-incubation) using a QIAamp PowerFecal DNA kit per the manufacturer’s protocol. The fecal DNA concentration was measured using a Qubit dsDNA HS assay kit. PCR was used to amplify the 16S rRNA gene at the V4V5 region, then sequenced using an Illumina MiSeq to produce a sequence library using the Earth Microbiome Project protocol. This approach yielded 300 bp, paired end amplicon sequences at a target sequencing depth of 50,000 reads per sample. The 16S amplicon sequencing was done at the Center for Quantitative Life Sciences core facilities using established methods. Data preprocessing and identification of amplicon sequence variations (ASVs) was conducted using the DADA2 pipeline, as implemented in R (v3.5). Briefly, the reads were first trimmed for read quality and filtered for two expected errors, followed by a merging of paired reads and a removal of chimeric ASVs. The taxonomy was assigned using the Silva database V132 with the naïve Bayesian classifier built into DADA2. Using phyloseq, the ASVs were first agglomerated at the genus level, reducing the number of ASVs from 3557 to 935 due to a high number of unannotated ASVs. To remove noise from the dataset, highly rare genera not seen more than three times in at least 20% of the samples and with a relative abundance less than 0.001%, were filtered out, resulting in a final dataset of 72 ASVs. Rarefaction curves, using the vegan package, were built on agglomerated and filtered data to ensure all samples were sufficiently sequenced.

Metabolomics

Metabolites from fecal culture medium were extracted (100 μL culture/100 μL ice cold 80:20 methanol:water), mixed vigorously, and clarified by centrifugation (13,000× g for 10 min). The supernatants were further diluted 1:10 with ice cold 80:20 methanol:water (v/v) and transferred to mass spectrometry (MS) vials. SFN-NIT and IBN-NIT were detected using LC-MS/MS in positive ion mode, as previously described. Briefly, HPLC was performed on a Shimadzu Nexera system with a phenyl-3 stationary phase column (Inertsil Phenyl-3, 5 µm, 4.6 × 150 mm) coupled to a quadrupole time-of-flight MS (AB SCIEX TripleTOF 5600). The samples were randomized, auto-calibration was performed every two samples, and a quality control sample, composed of a pooled aliquot from each sample, was analyzed every 10 samples. MS/MS information was obtained for all samples using information dependent acquisition (IDA), while sequential window acquisition of all theoretical spectra (SWATH) was performed only on quality control samples. Spectral data were processed using Progenesis QI (NonLinear Dynamics v2.4). Peak deconvolution for [M + H]+, [M + Na]+, and [M + NH4]+ adducts in positive ionization mode, and [M−H]-, [M + FA-H]-, and [M−H2O−H]- in negative ionization mode was performed in Progenesis QI. Additionally, feature intensities were normalized in Progenesis QI across samples by total ion current of all features. Metabolomic features with a CV greater than 50 in QC samples were removed from analysis due to high technical variation. The resulting data matrix was exported as a CSV for downstream analysis in R.

Data Load In:

#Full datasets from Progenesis:
posugly <- read_csv('./data/raw_progenesis/R_Friendly/pos_metabolome_full.csv')
negugly <- read_csv('./data/raw_progenesis/R_Friendly/neg_metabolome_full.csv')
#Metadata
mdata_neg <- read_csv('~/Documents/Projects//poopsoup/data/metabolomics/neg/metadata_neg.csv') %>%
  mutate(group = ifelse(group == 'A', 'C_Type', 'E_type'))
mdata_pos <- read_csv('~/Documents/Projects/poopsoup/data/metabolomics/pos/metadata_pos.csv') %>%
  mutate(group = ifelse(group == 'A', 'C_Type', 'E_type'))

#Remove features which have a CV greater than 50:
pos_full <- posugly %>%
  filter(is.na(cv_g_50)) %>%
  select(compound, starts_with('pos')) %>%
  mutate(compound = paste0(compound, '_pos')) %>%
  pivot_longer(starts_with('pos'), names_to = 'sample', values_to = 'intensity') %>%
  pivot_wider(names_from = 'compound', values_from = 'intensity')

neg_full <- negugly %>%
  filter(is.na(cv_g_50)) %>%
  select(compound, starts_with('neg')) %>%
  mutate(compound = paste0(compound, '_neg')) %>%
  pivot_longer(starts_with('neg'), names_to = 'sample', values_to = 'intensity') %>%
  pivot_wider(names_from = 'compound', values_from = 'intensity')

We will start by transforming our data prior to analysis. First we will log-transform our data followed by Pareto-scaling. Pareto-scaling reduces the relative importance of large values while keeping the data structure partially in tact. Since Pareto-scaling is sensitive to large fold-changes in the data, log_2 -transformation is first applied. To handle 0s, in the data, the following log-transformation is applied: log2(x) = log2(x + sqrt(x^2 + a^2))/2 where a is the minimum non-0 value in the data.

#Scale our data:
pos_scaled <- pos_full %>% 
  column_to_rownames('sample') %>%
  log_transform() %>%
  pareto_scale()%>%
  as.data.frame() 


neg_scaled <- neg_full %>% 
  column_to_rownames('sample') %>%
  log_transform() %>%
  pareto_scale()%>%
  as.data.frame() 

#Generate the metadata to be in the same order as our metabolomics data
mpos <- left_join(data.frame(sample = rownames(pos_scaled)), mdata_pos)
mneg <- left_join(data.frame(sample = rownames(neg_scaled)), mdata_neg)

First we will look at the structure of our metabolomics data using a PCA analysis.

Positive Mode

pca_pos <- prcomp(pos_scaled, center = TRUE)
autoplot(pca_pos, data = mpos, colour = 'treatment', shape = 'group', size = 3) + ggtitle('Positive Ionization Mode') +
  stat_ellipse(aes(group = treatment, color = treatment, fill = treatment), geom = 'polygon', alpha = 0.2)  +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  cowplot::theme_minimal_grid()

Negative Mode

pca_neg <- prcomp(neg_scaled, center = TRUE)
autoplot(pca_neg, data = mneg, colour = 'treatment', shape = 'group', size = 3) + ggtitle('Negative Ionization Mode')  +
  stat_ellipse(aes(group = treatment, color = treatment, fill = treatment), geom = 'polygon', alpha = 0.2) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  cowplot::theme_minimal_grid()

Feature Filtering

Unsurprisingly, no_veg, the negative control, appears to be the most different in our dataset but the other groups don’t seem to separate much. For negative ionization mode our dataset contains 11,584 features and for positive ionization mode our dataset contains 16,420 features (It is expected that positive ionization mode will contain a greater number of features due to a higher signal intensity compared to negative ionization mode (i.e. technical differences)). The high dimensionality our dataset may pose a problem for later analyses. Additionally, some features are redundant between our treatment groups, for example bile acids (a component of the in vitro digestion), which are present in the control digest and all other digests. To reduce the dimensionality of our dataset and remove noise, we only want to consider features of interest for our downstream multiomic integration. Additionally, we want to highlight two subsets of features, those which are differentially abundant between the different treatment groups, and those which are differentially abundant between our C_type and E_type individuals.

Repeated Measures ANOVA

pos_prog <- read_csv('~/Documents/Projects/poopsoup/poopsoup_two/data/raw_progenesis/R_Friendly/pos_metabolome_full.csv')
neg_prog <- read_csv('~/Documents/Projects/poopsoup/poopsoup_two/data/raw_progenesis/R_Friendly/neg_metabolome_full.csv')

prog_sig_pos <- pos_prog %>%
  filter(!is.na(q_u_05)) %>%
  dplyr::select(compound, starts_with('pos')) %>%
  mutate(compound = paste0(compound, '_pos')) %>%
  pivot_longer(starts_with('pos'), names_to = 'sample', values_to = 'intensity') %>%
  pivot_wider(names_from = 'compound', values_from = 'intensity')

prog_sig_neg <- neg_prog %>%
  filter(!is.na(q_u_05)) %>%
  dplyr::select(compound, starts_with('neg')) %>%
  mutate(compound = paste0(compound, '_neg')) %>%
  pivot_longer(starts_with('neg'), names_to = 'sample', values_to = 'intensity') %>%
  pivot_wider(names_from = 'compound', values_from = 'intensity')

rmanova_sig_neg <- colnames(prog_sig_neg)[-1]
rmanova_sig_pos <- colnames(prog_sig_pos)[-1]

To determine the impact of treatment (no_veg, broc, brus, combo) on the metabolomic profile of the fecal cultures, a repeated measures ANOVA (RMANOVA) was conducted using Progenesis QI software. The hyperbolic arcsine (arcsinh) transformation was utilized prior to the RMANOVA being conducted. The False Discovery Rate (FDR) was controlled for at 5% (significance was determined as q < 0.05). In positive ionization mode 3766 features were found to be significantly different between treatment groups and in negative ion mode 2152 features were found to be significantly different between treatment groups.

Generalized Linear Mixed Model

Next we want to find features significantly different between the two microbiome profiles observed in our fecal cultures. To do so, we will be using a generalized linear mixed model (GLMM). The no_veg group has an unequal variance from the other three treatment groups, so instead of running a model with all four groups, one model containing the three vegetable groups (broc , brus, combo) will be run and a control model containing no_veg will be run separately then compared against it. LmerTest will be used to construct contrasts to test our model and our model will be built using the formula: intensity ~ treatment + group + treatment*group + (1|fecal_sample)

library(lmerTest)
lmer_data_pos <- mdata_pos %>%
  select(sample, treatment, fecal_sample, group) %>%
  right_join(pos_scaled %>% rownames_to_column('sample')) %>% 
  filter(treatment != 'no_veg') %>% #Remove from analysis as it has different variance from the rest of the samples
  column_to_rownames('sample') %>%
  select(where(is.numeric)) %>%
  select(which(apply(., 2, sd) != 0)) %>% #Remove samples with 0 variance
  rownames_to_column('sample') %>%
  left_join(mdata_pos %>% select(sample, treatment, fecal_sample, group)) %>%
  pivot_longer(cols = where(is.numeric), names_to = 'feature', values_to = 'intensity')

lmer_pos <- lmer_data_pos %>% 
  group_by(feature) %>%
  nest() %>% # makes tibble of data for each feature, saves as 'lmer' column
  mutate_at( # change the data in 'lmer' column
    "data",
    purrr::map, # allows iteration over data in 'lmer' column, returns list
    function(x) {
      lmer <- lmerTest::lmer(intensity ~ treatment + group + treatment * group + (1 |fecal_sample), data = x) # do the lmer test
      c(lmer, x) # return results of lmer, as well as data for (r)anova later
    } 
  )

broc_pos <- lmer_pos %>%
  group_by(feature) %>%
  group_modify(~ {
    contest(.x$data[[1]][[1]], c(0,0,0,1,0,0), joint = FALSE)
  }) %>%
  set_colnames(c('feature', 'estimate', 'stderror', 'df', 'tvalue', 'lower', 'upper', 'pvalue')) %>%
  inset('term', value = 'broc') %>%
  ungroup()

brus_pos <- lmer_pos %>%
  group_by(feature) %>%
  group_modify(~ {
    contest(.x$data[[1]][[1]], c(0,0,0,1,1,0), joint = FALSE)
  }) %>%
  set_colnames(c('feature', 'estimate', 'stderror', 'df', 'tvalue', 'lower', 'upper', 'pvalue')) %>%
  inset('term', value = 'brus') %>%
  ungroup()

combo_pos <- lmer_pos %>%
  group_by(feature) %>%
  group_modify(~ {
    contest(.x$data[[1]][[1]], c(0,0,0,1,0,1), joint = FALSE)
  }) %>%
  set_colnames(c('feature', 'estimate', 'stderror', 'df', 'tvalue', 'lower', 'upper', 'pvalue')) %>%
  inset('term', value = 'combo') %>%
  ungroup()

control_data_pos <- mdata_pos %>%
  select(sample, treatment, fecal_sample, group) %>%
  right_join(pos_scaled %>% rownames_to_column('sample')) %>% 
  filter(treatment != 'no_veg') %>% #Remove from analysis as it has different variance from the rest of the samples
  column_to_rownames('sample') %>%
  select(where(is.numeric)) %>%
  select(which(apply(., 2, sd) != 0)) %>% #Remove samples with 0 variance
  rownames_to_column('sample') %>%
  left_join(mdata_pos %>% select(sample, treatment, fecal_sample, group)) %>%
  pivot_longer(cols = where(is.numeric), names_to = 'feature', values_to = 'intensity')

control_model_pos <- control_data_pos %>% 
  group_by(feature) %>%
  nest() %>% # makes tibble of data for each feature, saves as 'lmer' column
  mutate_at( # change the data in 'lmer' column
    "data",
    purrr::map, # allows iteration over data in 'lmer' column, returns list
    function(x) {
      lmr <- lm(intensity ~ group, data = x) %>% # do the lmer test
        summary() %>%
        broom::tidy()
      lmr[2,] # return results of lmer, as well as data for (r)anova later
    } 
  ) %>%
  unnest('data') %>%
  rename('pval' = p.value) %>%
  mutate(padj = p.adjust(pval, method = 'BH')) %>%
  select(feature, pval, padj)

sig_control_pos <- control_model_pos %>%
  filter(padj <= 0.05) %>%
  pull(feature)


lmer_data_neg <- mdata_neg %>%
  select(sample, treatment, fecal_sample, group) %>%
  right_join(neg_scaled %>% rownames_to_column('sample')) %>% 
  filter(treatment != 'no_veg') %>% #Remove from analysis as it has different variance from the rest of the samples
  column_to_rownames('sample') %>%
  select(where(is.numeric)) %>%
  select(which(apply(., 2, sd) != 0)) %>% #Remove samples with 0 variance
  rownames_to_column('sample') %>%
  left_join(mdata_neg %>% select(sample, treatment, fecal_sample, group)) %>%
  pivot_longer(cols = where(is.numeric), names_to = 'feature', values_to = 'intensity')

lmer_neg <- lmer_data_neg %>% 
  group_by(feature) %>%
  nest() %>% # makes tibble of data for each feature, saves as 'lmer' column
  mutate_at( # change the data in 'lmer' column
    "data",
    purrr::map, # allows iteration over data in 'lmer' column, returns list
    function(x) {
      lmer <- lmerTest::lmer(intensity ~ treatment + group + treatment * group + (1 |fecal_sample), data = x) # do the lmer test
      c(lmer, x) # return results of lmer, as well as data for (r)anova later
    } 
  )

broc_neg <- lmer_neg %>%
  group_by(feature) %>%
  group_modify(~ {
    contest(.x$data[[1]][[1]], c(0,0,0,1,0,0), joint = FALSE)
  }) %>%
  set_colnames(c('feature', 'estimate', 'stderror', 'df', 'tvalue', 'lower', 'upper', 'pvalue')) %>%
  inset('term', value = 'broc') %>%
  ungroup()

brus_neg <- lmer_neg %>%
  group_by(feature) %>%
  group_modify(~ {
    contest(.x$data[[1]][[1]], c(0,0,0,1,1,0), joint = FALSE)
  }) %>%
  set_colnames(c('feature', 'estimate', 'stderror', 'df', 'tvalue', 'lower', 'upper', 'pvalue')) %>%
  inset('term', value = 'brus') %>%
  ungroup()

combo_neg <- lmer_neg %>%
  group_by(feature) %>%
  group_modify(~ {
    contest(.x$data[[1]][[1]], c(0,0,0,1,0,1), joint = FALSE)
  }) %>%
  set_colnames(c('feature', 'estimate', 'stderror', 'df', 'tvalue', 'lower', 'upper', 'pvalue')) %>%
  inset('term', value = 'combo') %>%
  ungroup()

control_data_neg <- mdata_neg %>%
  select(sample, treatment, fecal_sample, group) %>%
  right_join(neg_scaled %>% rownames_to_column('sample')) %>% 
  filter(treatment != 'no_veg') %>% #Remove from analysis as it has different variance from the rest of the samples
  column_to_rownames('sample') %>%
  select(where(is.numeric)) %>%
  select(which(apply(., 2, sd) != 0)) %>% #Remove samples with 0 variance
  rownames_to_column('sample') %>%
  left_join(mdata_neg %>% select(sample, treatment, fecal_sample, group)) %>%
  pivot_longer(cols = where(is.numeric), names_to = 'feature', values_to = 'intensity')

control_model_neg <- control_data_neg %>% 
  group_by(feature) %>%
  nest() %>% # makes tibble of data for each feature, saves as 'lmer' column
  mutate_at( # change the data in 'lmer' column
    "data",
    purrr::map, # allows iteration over data in 'lmer' column, returns list
    function(x) {
      lmr <- lm(intensity ~ group, data = x) %>% # do the lmer test
        summary() %>%
        broom::tidy()
      lmr[2,] # return results of lmer, as well as data for (r)anova later
    } 
  ) %>%
  unnest('data') %>%
  rename('pval' = p.value) %>%
  mutate(padj = p.adjust(pval, method = 'BH')) %>%
  select(feature, pval, padj)

sig_control_neg <- control_model_neg %>%
  filter(padj <= 0.05) %>%
  pull(feature)

unique_neg <- rbind(broc_neg, brus_neg, combo_neg) %>%
  mutate(padj = p.adjust(pvalue, method = 'BH')) %>%
  dplyr::filter(padj <= 0.05) %>%
  filter(!feature %in% sig_control_neg)

unique_pos <- rbind(broc_pos, brus_pos, combo_pos) %>%
  mutate(padj = p.adjust(pvalue, method = 'BH')) %>%
  dplyr::filter(padj <= 0.05) %>%
  filter(!feature %in% sig_control_pos)

final_neg <- rbind(broc_neg, brus_neg, combo_neg) %>%
  mutate(padj = p.adjust(pvalue, method = 'BH')) %>%
  dplyr::filter(padj <= 0.05) 

final_pos <- rbind(broc_pos, brus_pos, combo_pos) %>%
  mutate(padj = p.adjust(pvalue, method = 'BH')) %>%
  dplyr::filter(padj <= 0.05) 

glmm_sig_neg <- unique(final_neg$feature)
glmm_sig_pos <- unique(final_pos$feature)


#save(glmm_sig_neg, glmm_sig_pos, file = 'glmm_out.RData')

After comparing the treatment model to the control model, we find 2 features in positive ion mode and 3 features in negative ion mode that are significantly different in abundance between the two microbiome profiles in our individuals. Due to the low number of these findings, we will also include features that were also found to be significantly different in the control model taking our total number of significantly different features in positive and negative ion modes to 116 features and 136 features, respectively.

neg_ft <- unique(c(glmm_sig_neg, rmanova_sig_neg)) #2281 Features
pos_ft <- unique(c(glmm_sig_pos, rmanova_sig_pos)) #3871 Features

reduced_set_neg <- neg_scaled[ ,colnames(neg_scaled) %in% neg_ft]
reduced_set_pos <- pos_scaled[ ,colnames(pos_scaled) %in% pos_ft]

Lastly we combine the findings of our RMANOVA and GLMM to produce our final feature list which will be used for the multiomic analyses. In negative ion mode 6 features overlap between the GLMM and RMANOVA results while in positive ion mode 10 features overlap between the two modes. Our final datasets are 2281 features and 3871 features large for negative and positive ion mode, respectively.

Before moving forward to our multi-omic analyses, let’s construct our PCAs again to see how reducing our features as impacted the results.

Positive Mode

pca_pos_re <- prcomp(reduced_set_pos, center = TRUE)
autoplot(pca_pos_re, data = mpos, colour = 'treatment', shape = 'group', size = 3) + 
  ggtitle('Positive Ionization Mode - Filtered') +
  stat_ellipse(aes(group = treatment, color = treatment, fill = treatment), geom = 'polygon', alpha = 0.2) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  cowplot::theme_minimal_grid()

Negative Mode

pca_neg_re <- prcomp(reduced_set_neg, center = TRUE)
autoplot(pca_neg_re, data = mneg, colour = 'treatment', shape = 'group', size = 3) + 
  ggtitle('Negative Ionization Mode - Filtered') +
  stat_ellipse(aes(group = treatment, color = treatment, fill = treatment), geom = 'polygon', alpha = 0.2) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  cowplot::theme_minimal_grid()

Combined Modes

combined_set <- cbind(reduced_set_neg, reduced_set_pos)
rownames(combined_set) <- gsub('neg', 'ms', rownames(combined_set))
mcom <- mneg %>%
  mutate(sample = gsub('neg', 'ms', sample)) 

pca_com_re <- prcomp(combined_set, center = TRUE)
autoplot(pca_com_re, data = mcom, colour = 'treatment', shape = 'group', size = 3) + ggtitle('Combined Filtered Set') +
  stat_ellipse(aes(group = treatment, color = treatment, fill = treatment), geom = 'polygon', alpha = 0.2) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  cowplot::theme_minimal_grid()

Unsurprisingly, on PC1 we see no_veg pulling away from the broccoli groups and a relatively large amount of the total variation compared to the full dataset (~35% vs ~15%) is now attributed to a treatment effect. PC2 looks like a small amount of separation is occurring by group, but it is not super convincing.

Lastly, we have combined our positive and negative ion modes to create a single dataset that is reflective of the entire metabolome.Work is ongoing to use a tool called MSCombine that will help remove redundancies from these sets.

Question 1

Is reducing the number of features for future analyses acceptable or should I use the entire dataset? I am worried that due to the extremely high dimensionality of the combined, unfiltered dataset (~28,000 features) there will be too much noise to find any prominent signals. Most multi-omic papers that I have seen (and statistical method papers) seem to narrow down their features to a select set, typically based on biological information. Since the identified of many of our metabolomic features is unknown and I am trying to take a data-driven approach to this analysis, this is a bit hard, thus, why I am trying to use statistical methods to select metabolomic features of interest that may play a role in metabolism

Marginal Correlation Analysis

Our first multiomic technique that we will apply is a simple marginal correlation analysis. For this we will simply run a spearman’s correlation between ASVs and metabolomic features. No grouping nor treatment information will be given.

multiset_wide <-  combined_set %>%
  rename_with(~paste0('FT_', .x)) %>%
  rownames_to_column('sample') %>%
  left_join(microdata)

multiset_tidy <- multiset_wide %>%
  pivot_longer(cols = starts_with('FT'), values_to = 'intensity', names_to = 'feature') %>%
  pivot_longer(cols = starts_with('ASV'), values_to = 'relab', names_to = 'ASV') %>%
  left_join(mcom)
spm_out <- multiset_tidy %>%
  group_by(feature, ASV) %>%
  nest() %>%
  mutate(spm = map(data, function(x) cor.test(x$relab, x$intensity, method = 'spearman'))) 

spearman <- spm_out %>%
  ungroup() %>%
  mutate(rho = map_dbl(spm, function(x) x$estimate)) %>%
  mutate(pval = map_dbl(spm, function(x) x$p.value)) %>%
  mutate(padj = p.adjust(pval, method = 'BH'))

spm_ft <- spearman %>% 
  filter(padj <= 0.05) %>%
  pull(feature) %>%
  unique()

spm_asv <- spearman %>% 
  filter(padj <= 0.05) %>%
  pull(ASV) %>%
  unique()

Our spearman test evaluated each feature and ASV in a pairwise manner resulting in 442,944 tests. Of these tests, 11,440 came back as significant (2.6%). 4335 metabolomic features were found to have a significant correlation with at least 1 ASV, accounting for 70.5% of the metabolome. Similarly, 70 ASVs were found to have a significant correlation with at least 1 metabolomic feature accounting for 97% of the ASVs.

Sparse Generalized Canonical Correlation Analysis

Next we will utilize dimension reduction techniques to analyze our data. Specifically, we will be using a Sparse Generalized Canonical Correlation Analysis (SGCCA). The sparsity parameter for each of our models is the optimal (calculated by the package using the Schafer and Strimmer formula) shrinkage parameter from the corresponding RGCCA (regularized general canonical correlation) model. For each block, 5 components will be built and the components which offer the best separation of both treatment and microbiome profile will be selected.

ft_block <- multiset_wide %>%
  dplyr::select(starts_with('FT'))

asv_block <- multiset_wide %>%
  dplyr::select(starts_with('ASV'))

#Recreate the metadata to make sure that they are in the same order as the features
multiset_meta <- multiset_wide %>%
  dplyr::select('sample') %>%
  left_join(mcom) 

group_block <- multiset_meta %>%
  dplyr::select(sample, group) %>%
  mutate(group = ifelse(group == 'C_Type', 1, 0)) %>%
  dplyr::select(group)

treatment_block <- multiset_meta %>%
  dplyr::select(sample, treatment) %>%
  mutate(trt_broc = ifelse(treatment == 'broc', 1, 0)) %>%
  mutate(trt_brus = ifelse(treatment == 'brus', 1, 0)) %>%
  mutate(trt_combo = ifelse(treatment == 'combo', 1, 0)) %>%
  dplyr::select(starts_with('trt'))

Model 1

Our first SGCCA analysis is akin to a multi-block PCA and will be used to identify metabolomic features and ASVs associated with the different treatment groups and the microbial profiles we observed in our individuals. The superblock contains a concatenation of the metabolomic features and the microbial ASVs. The design of the model is as follows:

superblock <- cbind(asv_block, ft_block)
A <- list(asv_block, ft_block, treatment_block, group_block, superblock)
               #A,F,T,G,S
C <- matrix(c(0,1,1,1,1,#A
              1,0,1,1,1,#F
              1,1,0,1,1,#T
              1,1,1,0,1,#G
              1,1,1,1,0), 5, 5)

mod1_params <- rgcca(A, C, tau = 'optimal', ncomp = c(5,5,1,1,5), scheme = 'factorial', scale = TRUE, verbose = FALSE)

mod1 <- sgcca(A, C, c1 = c(mod1_params$tau[1,1], mod1_params$tau[1,2], 1, 1, mod1_params$tau[1,5]), 
              ncomp = c(5,5,1,1,5), scheme = 'factorial', scale = TRUE, verbose = FALSE)
df1 <- data.frame(Trt = multiset_meta$treatment,
                  Grp = multiset_meta$group,
                  Subject = multiset_meta$fecal_sample,
                  ASV1 = mod1$Y[[1]][,1],
                  ASV2 = mod1$Y[[1]][,2],
                  ASV3 = mod1$Y[[1]][,3],
                  FT1 = mod1$Y[[2]][,1],
                  FT2 = mod1$Y[[2]][,2],
                  FT3 = mod1$Y[[2]][,3],
                  SB1 = mod1$Y[[5]][,1],
                  SB2 = mod1$Y[[5]][,2],
                  SB3 = mod1$Y[[5]][,3],
                  SB4 = mod1$Y[[5]][,4],
                  SB5 = mod1$Y[[5]][,5])
ggplot(df1, aes(x = SB1, y = SB2, color = Trt)) +
  geom_point(aes(shape = Grp), size = 3) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  stat_ellipse(aes(group = Trt, fill = Trt), geom = 'polygon', alpha = 0.1)  +
  theme(legend.position="bottom", legend.box = "horizontal",
        legend.title = element_blank()) +
  ggtitle('Model 1') +
  theme_minimal_grid() +
  xlab('Superblock Component 1') +
  ylab('Superblock Component 2')

With the exception of a few individuals, the SGCCA did a good job at separating individuals by both treatment and microbiome profile. We will now dig into the variables using a correlation circle to identify features related to each metadata item. Features with a positive loading for superblock component 1 are associated with C-Type individuals while features with a negative loading are associated with E-Type individuals. Similarly, on Component 2 negative loadings are associated with the negative control treatment group while positive loadings are associated with the combo treatment group. When plotted together, features which lie close to each other on correlation circle are associated with one another. For now, I am not plotting the correlation circle but instead pulling out features which have high loadings to examine.

mod1_ft <- mod1$a[[5]][mod1$a[[5]][,1] != 0 | mod1$a[[5]][,2] != 0, ] %>%
  as.data.frame() %>%
  dplyr::select(V1, V2) %>%
  rownames_to_column('feature') %>%
  rename('comp1' = V1, 'comp2' = V2) %>%
  mutate(type = ifelse(str_detect(feature, 'ASV'), 'ASV', 'FT'))

mod1_ft %>%
  dplyr::filter(type == 'ASV')
##   feature       comp1       comp2 type
## 1    ASV1 -0.07557776 0.000000000  ASV
## 2    ASV4  0.07452644 0.000000000  ASV
## 3  ASV354  0.00000000 0.003486842  ASV
## 4 ASV1584  0.00000000 0.053111178  ASV

In the ASV dimension regularization resulted in 4 features not being shrunk to 0. This relatively small number of ASVs (5% of total ASVs) is not hugely surprising as our previous analysis of the microbiome showed that treatment had no effect on microbiome composition. ASV1 has a loading of -0.076 on component 1 and ASV4 has a loading of 0.075 on component 1. Unsurprisingly ASV1 is from family Enterobacteriaceae and ASV4 is from family Clostridiaceae, validating that our analysis is behaving as expected. For component 2, ASV354 had a very weak positive loading of 0.0035 and ASV1584 had a loading of 0.053. ASV1584 is from Genus Eggerthellaceae of family Coriobacteriales. Previous work in humans in humans has shown that cruciferous vegetable consumption is associated with an increase in bacteria from genus Eggerthellaceae. Additionally, kale consumption increased the abundance of bacteria from family Coriobacteriales in a diet-induced obesity mouse model.

For metabolomic features, 706 total features were retained by the model (11% of total features). Work is currently ongoing to identify the metabolomic features, however, to help validate our findings we can plot the intensity of each feature across the different treatment groups and the different microbial profiles. We will start with relatively strong loadings from component 1, which should show an association with a microbiome profile.

ftset <- mod1_ft %>% filter(type == 'FT')
cp1_top <- ftset %>%
  dplyr::arrange(abs(comp1)) %>%
  tail(n = 10) %>%
  mutate(signal = ifelse(comp1 > 0, 'C-Type', 'E-Type'))
         
cp2_top <- ftset %>%
  dplyr::arrange(abs(comp2)) %>%
  tail(n = 20) %>%
  mutate(signal = ifelse(comp2 > 0, 'combo', 'noveg'))

combined_tidy <-  combined_set %>%
  rename_with(~paste0('FT_', .x)) %>%
  rownames_to_column('sample') %>%
  pivot_longer(cols = starts_with('FT'), values_to = 'intensity', names_to = 'feature') %>%
  left_join(multiset_meta) 

ft_comp1 <-  combined_tidy %>%
  dplyr::filter(feature %in% cp1_top$feature[1:6]) %>%
  left_join(cp1_top)

ggplot(ft_comp1, aes(x = group, y = intensity, color = signal)) +
  geom_point() +
  stat_summary(fun = 'mean', geom = 'point', color = 'black', size = 4, shape = 18) +
  facet_wrap(~feature, ncol = 3) 

Overall, it looks like the features of interest are behaving as expected and the mean of each group (black diamonds) is higher for metabolites associated with one microbial profile or the other (as designated by color). Let’s perform the same kind of sanity check for features on component 2 which should be associated with different treatment groups.

ft_comp2 <-  combined_tidy %>%
  dplyr::filter(feature %in% cp2_top$feature[c(18:20,2:4)]) %>%
  left_join(cp2_top)

ggplot(ft_comp2, aes(x = treatment, y = intensity, color = signal)) +
  geom_point() +
  stat_summary(fun = 'mean', geom = 'point', color = 'black', size = 4, shape = 18) +
  facet_wrap(~feature, ncol = 3) 

For these features, the signal definitely appears stronger for the combo-associated features. For no-veg associated features, it appears that no-veg is marginally higher and that combo is typically the lowest abundance. Once these features are identified in Progenesis we will be better able to evaluate these findings.

Question 2

Are these findings invalid because the metabolomics dataset was already primed for separation by only including features found to be significantly different between the two microbiome groups and between the treatment groups?

Model 2

Next we will run our SGCCA model again, but this time using it attempt to discriminate between the treatment groups and the microbiome profiles. First we will run a model with the goal to discriminate between microbial profiles. The model design is as follows:

A2 <- list(asv_block, ft_block, superblock, group_block)
              #A,F,S,G
C2 <- matrix(c(0,1,1,0,#A
               1,0,1,0,#F
               1,1,0,1,#S
               0,0,1,0), 4, 4)

mod2_params <- rgcca(A2, C2, tau = 'optimal', ncomp = c(5,5,5,1), scheme = 'factorial', scale = TRUE, verbose = TRUE)
## Computation of the RGCCA block components based on the factorial scheme 
## Optimal Shrinkage intensity paramaters are estimated 
## Computation of the RGCCA block components #1 is under progress...
##  Iter:    1  Fit: 1.41445304  Dif:  0.17533473 
##  Iter:    2  Fit: 1.41520860  Dif:  0.00075556 
##  Iter:    3  Fit: 1.41529075  Dif:  0.00008216 
##  Iter:    4  Fit: 1.41529899  Dif:  0.00000824 
##  Iter:    5  Fit: 1.41529983  Dif:  0.00000084 
##  Iter:    6  Fit: 1.41529992  Dif:  0.00000009 
##  Iter:    7  Fit: 1.41529993  Dif:  0.00000001 
## The RGCCA algorithm converged to a stationary point after 6 iterations

## Computation of the RGCCA block components #2 is under progress...
##  Iter:    1  Fit: 0.14673432  Dif:  0.05385639 
##  Iter:    2  Fit: 0.14776916  Dif:  0.00103485 
##  Iter:    3  Fit: 0.14829857  Dif:  0.00052940 
##  Iter:    4  Fit: 0.14857093  Dif:  0.00027237 
##  Iter:    5  Fit: 0.14871663  Dif:  0.00014570 
##  Iter:    6  Fit: 0.14879740  Dif:  0.00008077 
##  Iter:    7  Fit: 0.14884374  Dif:  0.00004634 
##  Iter:    8  Fit: 0.14887115  Dif:  0.00002741 
##  Iter:    9  Fit: 0.14888777  Dif:  0.00001663 
##  Iter:   10  Fit: 0.14889808  Dif:  0.00001030 
##  Iter:   11  Fit: 0.14890456  Dif:  0.00000649 
##  Iter:   12  Fit: 0.14890870  Dif:  0.00000413 
##  Iter:   13  Fit: 0.14891135  Dif:  0.00000266 
##  Iter:   14  Fit: 0.14891308  Dif:  0.00000172 
##  Iter:   15  Fit: 0.14891420  Dif:  0.00000112 
##  Iter:   16  Fit: 0.14891493  Dif:  0.00000073 
##  Iter:   17  Fit: 0.14891541  Dif:  0.00000048 
##  Iter:   18  Fit: 0.14891573  Dif:  0.00000032 
##  Iter:   19  Fit: 0.14891593  Dif:  0.00000021 
##  Iter:   20  Fit: 0.14891607  Dif:  0.00000014 
##  Iter:   21  Fit: 0.14891616  Dif:  0.00000009 
##  Iter:   22  Fit: 0.14891622  Dif:  0.00000006 
##  Iter:   23  Fit: 0.14891626  Dif:  0.00000004 
##  Iter:   24  Fit: 0.14891628  Dif:  0.00000003 
##  Iter:   25  Fit: 0.14891630  Dif:  0.00000002 
##  Iter:   26  Fit: 0.14891631  Dif:  0.00000001 
##  Iter:   27  Fit: 0.14891632  Dif:  0.00000001 
## The RGCCA algorithm converged to a stationary point after 26 iterations

## Computation of the RGCCA block components #3 is under progress...
##  Iter:    1  Fit: 0.11731866  Dif:  0.07207432 
##  Iter:    2  Fit: 0.12005085  Dif:  0.00273220 
##  Iter:    3  Fit: 0.12075460  Dif:  0.00070374 
##  Iter:    4  Fit: 0.12093861  Dif:  0.00018401 
##  Iter:    5  Fit: 0.12099003  Dif:  0.00005142 
##  Iter:    6  Fit: 0.12100494  Dif:  0.00001491 
##  Iter:    7  Fit: 0.12100938  Dif:  0.00000444 
##  Iter:    8  Fit: 0.12101073  Dif:  0.00000135 
##  Iter:    9  Fit: 0.12101115  Dif:  0.00000041 
##  Iter:   10  Fit: 0.12101128  Dif:  0.00000013 
##  Iter:   11  Fit: 0.12101132  Dif:  0.00000004 
##  Iter:   12  Fit: 0.12101133  Dif:  0.00000001 
##  Iter:   13  Fit: 0.12101133  Dif:  0.00000000 
## The RGCCA algorithm converged to a stationary point after 12 iterations

## Computation of the RGCCA block components #4 is under progress...
##  Iter:    1  Fit: 0.06147526  Dif:  0.03612266 
##  Iter:    2  Fit: 0.06385555  Dif:  0.00238029 
##  Iter:    3  Fit: 0.06486336  Dif:  0.00100780 
##  Iter:    4  Fit: 0.06536048  Dif:  0.00049712 
##  Iter:    5  Fit: 0.06561527  Dif:  0.00025479 
##  Iter:    6  Fit: 0.06574670  Dif:  0.00013143 
##  Iter:    7  Fit: 0.06581454  Dif:  0.00006784 
##  Iter:    8  Fit: 0.06584955  Dif:  0.00003501 
##  Iter:    9  Fit: 0.06586761  Dif:  0.00001806 
##  Iter:   10  Fit: 0.06587693  Dif:  0.00000932 
##  Iter:   11  Fit: 0.06588174  Dif:  0.00000481 
##  Iter:   12  Fit: 0.06588422  Dif:  0.00000248 
##  Iter:   13  Fit: 0.06588550  Dif:  0.00000128 
##  Iter:   14  Fit: 0.06588616  Dif:  0.00000066 
##  Iter:   15  Fit: 0.06588650  Dif:  0.00000034 
##  Iter:   16  Fit: 0.06588667  Dif:  0.00000018 
##  Iter:   17  Fit: 0.06588676  Dif:  0.00000009 
##  Iter:   18  Fit: 0.06588681  Dif:  0.00000005 
##  Iter:   19  Fit: 0.06588683  Dif:  0.00000002 
##  Iter:   20  Fit: 0.06588685  Dif:  0.00000001 
##  Iter:   21  Fit: 0.06588685  Dif:  0.00000001 
## The RGCCA algorithm converged to a stationary point after 20 iterations

## Computation of the RGCCA block components #5 is under progress ... 
##  Iter:    1  Fit: 0.04065321  Dif:  0.02709375 
##  Iter:    2  Fit: 0.04239937  Dif:  0.00174615 
##  Iter:    3  Fit: 0.04326045  Dif:  0.00086108 
##  Iter:    4  Fit: 0.04371215  Dif:  0.00045170 
##  Iter:    5  Fit: 0.04395233  Dif:  0.00024018 
##  Iter:    6  Fit: 0.04408108  Dif:  0.00012875 
##  Iter:    7  Fit: 0.04415051  Dif:  0.00006944 
##  Iter:    8  Fit: 0.04418812  Dif:  0.00003761 
##  Iter:    9  Fit: 0.04420856  Dif:  0.00002043 
##  Iter:   10  Fit: 0.04421968  Dif:  0.00001113 
##  Iter:   11  Fit: 0.04422575  Dif:  0.00000607 
##  Iter:   12  Fit: 0.04422906  Dif:  0.00000331 
##  Iter:   13  Fit: 0.04423087  Dif:  0.00000181 
##  Iter:   14  Fit: 0.04423186  Dif:  0.00000099 
##  Iter:   15  Fit: 0.04423240  Dif:  0.00000054 
##  Iter:   16  Fit: 0.04423270  Dif:  0.00000030 
##  Iter:   17  Fit: 0.04423286  Dif:  0.00000016 
##  Iter:   18  Fit: 0.04423295  Dif:  0.00000009 
##  Iter:   19  Fit: 0.04423300  Dif:  0.00000005 
##  Iter:   20  Fit: 0.04423302  Dif:  0.00000003 
##  Iter:   21  Fit: 0.04423304  Dif:  0.00000001 
##  Iter:   22  Fit: 0.04423305  Dif:  0.00000001 
## The RGCCA algorithm converged to a stationary point after 21 iterations

mod2 <- sgcca(A2, C2, c1 = c(mod2_params$tau[1,1], mod2_params$tau[1,2], mod2_params$tau[1,3], 1), 
              ncomp = c(5,5,5,1), scheme = 'factorial', scale = TRUE, verbose = TRUE)
## Computation of the SGCCA block components based on the factorial scheme 
## Computation of the SGCCA block components #1 is under progress... 
##  Iter:    1  Fit:  0.03251707  Dif:  -0.29898278 
##  Iter:    2  Fit:  0.04841705  Dif:  0.01589998 
##  Iter:    3  Fit:  0.04988385  Dif:  0.00146680 
##  Iter:    4  Fit:  0.05098899  Dif:  0.00110514 
##  Iter:    5  Fit:  0.05325139  Dif:  0.00226240 
##  Iter:    6  Fit:  0.05440393  Dif:  0.00115254 
##  Iter:    7  Fit:  0.05447278  Dif:  0.00006886 
##  Iter:    8  Fit:  0.05447345  Dif:  0.00000067 
##  Iter:    9  Fit:  0.05447346  Dif:  0.00000001 
##  Iter:   10  Fit:  0.05447346  Dif:  0.00000000 
##  Iter:   11  Fit:  0.05447346  Dif:  0.00000000 
##  Iter:   12  Fit:  0.05447346  Dif:  0.00000000 
##  Iter:   13  Fit:  0.05447346  Dif:  0.00000000 
##  Iter:   14  Fit:  0.05447346  Dif:  -0.00000000 
##  Iter:   15  Fit:  0.05447346  Dif:  0.00000000 
## The SGCCA algorithm converged to a stationary point after 14 iterations

## Computation of the SGCCA block components #2 is under progress... 
##  Iter:    1  Fit:  0.00485478  Dif:  -0.30814404 
##  Iter:    2  Fit:  0.00607913  Dif:  0.00122436 
##  Iter:    3  Fit:  0.00732915  Dif:  0.00125002 
##  Iter:    4  Fit:  0.00771437  Dif:  0.00038521 
##  Iter:    5  Fit:  0.00778833  Dif:  0.00007396 
##  Iter:    6  Fit:  0.00784101  Dif:  0.00005268 
##  Iter:    7  Fit:  0.00794174  Dif:  0.00010073 
##  Iter:    8  Fit:  0.00827225  Dif:  0.00033051 
##  Iter:    9  Fit:  0.00917315  Dif:  0.00090090 
##  Iter:   10  Fit:  0.00988020  Dif:  0.00070706 
##  Iter:   11  Fit:  0.01017780  Dif:  0.00029759 
##  Iter:   12  Fit:  0.01028016  Dif:  0.00010237 
##  Iter:   13  Fit:  0.01030641  Dif:  0.00002625 
##  Iter:   14  Fit:  0.01031174  Dif:  0.00000533 
##  Iter:   15  Fit:  0.01031293  Dif:  0.00000119 
##  Iter:   16  Fit:  0.01031321  Dif:  0.00000028 
##  Iter:   17  Fit:  0.01031328  Dif:  0.00000007 
##  Iter:   18  Fit:  0.01031330  Dif:  0.00000002 
##  Iter:   19  Fit:  0.01031330  Dif:  0.00000000 
##  Iter:   20  Fit:  0.01031330  Dif:  0.00000000 
##  Iter:   21  Fit:  0.01031330  Dif:  0.00000000 
##  Iter:   22  Fit:  0.01031330  Dif:  0.00000000 
##  Iter:   23  Fit:  0.01031330  Dif:  0.00000000 
##  Iter:   24  Fit:  0.01031330  Dif:  0.00000000 
##  Iter:   25  Fit:  0.01031330  Dif:  0.00000000 
##  Iter:   26  Fit:  0.01031330  Dif:  0.00000000 
##  Iter:   27  Fit:  0.01031330  Dif:  -0.00000000 
##  Iter:   28  Fit:  0.01031330  Dif:  0.00000000 
##  Iter:   29  Fit:  0.01031330  Dif:  -0.00000000 
##  Iter:   30  Fit:  0.01031330  Dif:  0.00000000 
##  Iter:   31  Fit:  0.01031330  Dif:  -0.00000000 
##  Iter:   32  Fit:  0.01031330  Dif:  -0.00000000 
##  Iter:   33  Fit:  0.01031330  Dif:  -0.00000000 
##  Iter:   34  Fit:  0.01031330  Dif:  -0.00000000 
##  Iter:   35  Fit:  0.01031330  Dif:  0.00000000 
##  Iter:   36  Fit:  0.01031330  Dif:  0.00000000 
## The SGCCA algorithm converged to a stationary point after 35 iterations

## Computation of the SGCCA block components #3 is under progress... 
##  Iter:    1  Fit:  0.00365880  Dif:  -0.26232569 
##  Iter:    2  Fit:  0.00417679  Dif:  0.00051799 
##  Iter:    3  Fit:  0.00511713  Dif:  0.00094034 
##  Iter:    4  Fit:  0.00587827  Dif:  0.00076115 
##  Iter:    5  Fit:  0.00625426  Dif:  0.00037598 
##  Iter:    6  Fit:  0.00628777  Dif:  0.00003352 
##  Iter:    7  Fit:  0.00628911  Dif:  0.00000133 
##  Iter:    8  Fit:  0.00628916  Dif:  0.00000005 
##  Iter:    9  Fit:  0.00628916  Dif:  0.00000000 
##  Iter:   10  Fit:  0.00628916  Dif:  0.00000000 
##  Iter:   11  Fit:  0.00628916  Dif:  0.00000000 
##  Iter:   12  Fit:  0.00628916  Dif:  0.00000000 
##  Iter:   13  Fit:  0.00628916  Dif:  -0.00000000 
##  Iter:   14  Fit:  0.00628916  Dif:  0.00000000 
##  Iter:   15  Fit:  0.00628916  Dif:  -0.00000000 
##  Iter:   16  Fit:  0.00628916  Dif:  -0.00000000 
## The SGCCA algorithm converged to a stationary point after 15 iterations

## Computation of the SGCCA block components #4 is under progress... 
##  Iter:    1  Fit:  0.00290752  Dif:  -0.13573195 
##  Iter:    2  Fit:  0.00333494  Dif:  0.00042742 
##  Iter:    3  Fit:  0.00340125  Dif:  0.00006631 
##  Iter:    4  Fit:  0.00341071  Dif:  0.00000946 
##  Iter:    5  Fit:  0.00341409  Dif:  0.00000337 
##  Iter:    6  Fit:  0.00341581  Dif:  0.00000172 
##  Iter:    7  Fit:  0.00341662  Dif:  0.00000081 
##  Iter:    8  Fit:  0.00341697  Dif:  0.00000035 
##  Iter:    9  Fit:  0.00341712  Dif:  0.00000015 
##  Iter:   10  Fit:  0.00341718  Dif:  0.00000006 
##  Iter:   11  Fit:  0.00341721  Dif:  0.00000003 
##  Iter:   12  Fit:  0.00341722  Dif:  0.00000001 
##  Iter:   13  Fit:  0.00341722  Dif:  0.00000000 
##  Iter:   14  Fit:  0.00341722  Dif:  0.00000000 
##  Iter:   15  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   16  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   17  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   18  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   19  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   20  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   21  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   22  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   23  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   24  Fit:  0.00341723  Dif:  -0.00000000 
##  Iter:   25  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   26  Fit:  0.00341723  Dif:  -0.00000000 
##  Iter:   27  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   28  Fit:  0.00341723  Dif:  -0.00000000 
##  Iter:   29  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   30  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   31  Fit:  0.00341723  Dif:  -0.00000000 
##  Iter:   32  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   33  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   34  Fit:  0.00341723  Dif:  -0.00000000 
##  Iter:   35  Fit:  0.00341723  Dif:  -0.00000000 
##  Iter:   36  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   37  Fit:  0.00341723  Dif:  -0.00000000 
##  Iter:   38  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   39  Fit:  0.00341723  Dif:  -0.00000000 
##  Iter:   40  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   41  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   42  Fit:  0.00341723  Dif:  0.00000000 
##  Iter:   43  Fit:  0.00341723  Dif:  -0.00000000 
## The SGCCA algorithm converged to a stationary point after 42 iterations

## Computation of the SGCCA block components #5 is under progress...
##  Iter:    1  Fit:  0.00183043  Dif:  -0.02475115 
##  Iter:    2  Fit:  0.00208620  Dif:  0.00025577 
##  Iter:    3  Fit:  0.00219013  Dif:  0.00010393 
##  Iter:    4  Fit:  0.00220149  Dif:  0.00001135 
##  Iter:    5  Fit:  0.00220443  Dif:  0.00000294 
##  Iter:    6  Fit:  0.00220731  Dif:  0.00000288 
##  Iter:    7  Fit:  0.00221217  Dif:  0.00000486 
##  Iter:    8  Fit:  0.00222230  Dif:  0.00001013 
##  Iter:    9  Fit:  0.00223961  Dif:  0.00001731 
##  Iter:   10  Fit:  0.00226218  Dif:  0.00002257 
##  Iter:   11  Fit:  0.00228504  Dif:  0.00002286 
##  Iter:   12  Fit:  0.00230508  Dif:  0.00002004 
##  Iter:   13  Fit:  0.00231965  Dif:  0.00001456 
##  Iter:   14  Fit:  0.00232943  Dif:  0.00000978 
##  Iter:   15  Fit:  0.00233523  Dif:  0.00000580 
##  Iter:   16  Fit:  0.00233885  Dif:  0.00000362 
##  Iter:   17  Fit:  0.00234148  Dif:  0.00000263 
##  Iter:   18  Fit:  0.00234365  Dif:  0.00000217 
##  Iter:   19  Fit:  0.00234496  Dif:  0.00000131 
##  Iter:   20  Fit:  0.00234560  Dif:  0.00000065 
##  Iter:   21  Fit:  0.00234594  Dif:  0.00000034 
##  Iter:   22  Fit:  0.00234612  Dif:  0.00000018 
##  Iter:   23  Fit:  0.00234621  Dif:  0.00000009 
##  Iter:   24  Fit:  0.00234625  Dif:  0.00000004 
##  Iter:   25  Fit:  0.00234627  Dif:  0.00000002 
##  Iter:   26  Fit:  0.00234628  Dif:  0.00000001 
##  Iter:   27  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   28  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   29  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   30  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   31  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   32  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   33  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   34  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   35  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   36  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   37  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   38  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   39  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   40  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   41  Fit:  0.00234628  Dif:  -0.00000000 
##  Iter:   42  Fit:  0.00234628  Dif:  -0.00000000 
##  Iter:   43  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   44  Fit:  0.00234628  Dif:  -0.00000000 
##  Iter:   45  Fit:  0.00234628  Dif:  -0.00000000 
##  Iter:   46  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   47  Fit:  0.00234628  Dif:  -0.00000000 
##  Iter:   48  Fit:  0.00234628  Dif:  -0.00000000 
##  Iter:   49  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   50  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   51  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   52  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   53  Fit:  0.00234628  Dif:  -0.00000000 
##  Iter:   54  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   55  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   56  Fit:  0.00234628  Dif:  0.00000000 
##  Iter:   57  Fit:  0.00234628  Dif:  -0.00000000 
## The SGCCA algorithm converged to a stationary point after 56 iterations

df2 <- data.frame(Trt = multiset_meta$treatment,
                  Grp = multiset_meta$group,
                  Subject = multiset_meta$fecal_sample,
                  ASV1 = mod2$Y[[1]][,1],
                  ASV2 = mod2$Y[[1]][,2],
                  ASV3 = mod2$Y[[1]][,3],
                  FT1 = mod2$Y[[2]][,1],
                  FT2 = mod2$Y[[2]][,2],
                  FT3 = mod2$Y[[2]][,3],
                  SB1 = mod2$Y[[3]][,1],
                  SB2 = mod2$Y[[3]][,2],
                  SB3 = mod2$Y[[3]][,3],
                  SB4 = mod2$Y[[3]][,4],
                  SB5 = mod2$Y[[3]][,5])

ggplot(df2, aes(x = SB1, y = SB2, color = Trt)) +
  geom_point(aes(shape = Grp), size = 3) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  stat_ellipse(aes(group = Grp, fill = Grp), geom = 'polygon', alpha = 0.4)  +
  theme(legend.position="bottom", legend.box = "horizontal",
        legend.title = element_blank()) +
  ggtitle('Model 2') +
  theme_minimal_grid() +
  xlab('Superblock Component 1') +
  ylab('Superblock Component 2')

We see pretty clear clustering occuring across Component 1, let’s go thru and pull our highest loaded features again.

mod2_ft <- mod2$a[[3]][mod2$a[[3]][,1] != 0, ] %>%
  as.data.frame() %>%
  dplyr::select(V1, V2) %>%
  rownames_to_column('feature') %>%
  rename('comp1' = V1, 'comp2' = V2) %>%
  mutate(type = ifelse(str_detect(feature, 'ASV'), 'ASV', 'FT'))

mod2_ft %>%
  dplyr::filter(type == 'ASV')
##   feature       comp1       comp2 type
## 1    ASV1 -0.07599740  0.00000000  ASV
## 2    ASV4  0.07529448 -0.05267909  ASV

For ASVs, we are only getting ASVs 1 and 4 out, and they are behaving as expected.

Of the 266 metabolomic features that the model identified, 230 of them were also found to be significantly different between the microbial profiles by our GLMM. This makes up 86% of the features from the SGCCA and 91% of the features from our GLMM. Let’s focus in on the features not found from the GLMM:

grp_ft <- mod2_ft[!mod2_ft$feature %in% c(paste0('FT_', glmm_sig_neg), paste0('FT_', glmm_sig_pos)),] %>%
  filter(type == 'FT') %>%
  arrange(abs(comp1)) %>%
  tail(n = 20) %>%
  mutate(signal = ifelse(comp1 > 0, 'C-Type', 'E-Type'))

ft_comp2 <-  combined_tidy %>%
  dplyr::filter(feature %in% grp_ft$feature[c(18:20,17,13,2)]) %>%
  left_join(grp_ft)

ggplot(ft_comp2, aes(x = group, y = intensity, color = signal)) +
  geom_point() +
  stat_summary(fun = 'mean', geom = 'point', color = 'black', size = 4, shape = 18) +
  facet_wrap(~feature, ncol = 3) 

Once again, the means of each group are higher in microbial group they correspond to.

Model 3

Let’s now run the same type of model, but this time trying to discriminate by treatment. The model design is as follows:

A3 <- list(asv_block, ft_block, superblock, treatment_block)
              #A,F,S,G
C3 <- matrix(c(0,1,1,0,#A
               1,0,1,0,#F
               1,1,0,1,#S
               0,0,1,0), 4, 4)

mod3_params <- rgcca(A3, C3, tau = 'optimal', ncomp = c(5,5,5,1), scheme = 'factorial', scale = TRUE, verbose = TRUE)
## Computation of the RGCCA block components based on the factorial scheme 
## Optimal Shrinkage intensity paramaters are estimated 
## Computation of the RGCCA block components #1 is under progress...
##  Iter:    1  Fit: 2.03896774  Dif:  0.76664599 
##  Iter:    2  Fit: 2.04178197  Dif:  0.00281423 
##  Iter:    3  Fit: 2.04192355  Dif:  0.00014158 
##  Iter:    4  Fit: 2.04192829  Dif:  0.00000474 
##  Iter:    5  Fit: 2.04192847  Dif:  0.00000018 
##  Iter:    6  Fit: 2.04192848  Dif:  0.00000001 
## The RGCCA algorithm converged to a stationary point after 5 iterations

## Computation of the RGCCA block components #2 is under progress...
##  Iter:    1  Fit: 0.15620159  Dif:  0.05777921 
##  Iter:    2  Fit: 0.15818758  Dif:  0.00198598 
##  Iter:    3  Fit: 0.15910945  Dif:  0.00092187 
##  Iter:    4  Fit: 0.15972331  Dif:  0.00061387 
##  Iter:    5  Fit: 0.16020054  Dif:  0.00047723 
##  Iter:    6  Fit: 0.16058895  Dif:  0.00038841 
##  Iter:    7  Fit: 0.16090888  Dif:  0.00031993 
##  Iter:    8  Fit: 0.16117341  Dif:  0.00026453 
##  Iter:    9  Fit: 0.16139265  Dif:  0.00021924 
##  Iter:   10  Fit: 0.16157480  Dif:  0.00018214 
##  Iter:   11  Fit: 0.16172652  Dif:  0.00015173 
##  Iter:   12  Fit: 0.16185327  Dif:  0.00012675 
##  Iter:   13  Fit: 0.16195946  Dif:  0.00010619 
##  Iter:   14  Fit: 0.16204867  Dif:  0.00008922 
##  Iter:   15  Fit: 0.16212384  Dif:  0.00007516 
##  Iter:   16  Fit: 0.16218733  Dif:  0.00006349 
##  Iter:   17  Fit: 0.16224110  Dif:  0.00005377 
##  Iter:   18  Fit: 0.16228673  Dif:  0.00004563 
##  Iter:   19  Fit: 0.16232555  Dif:  0.00003882 
##  Iter:   20  Fit: 0.16235863  Dif:  0.00003308 
##  Iter:   21  Fit: 0.16238687  Dif:  0.00002825 
##  Iter:   22  Fit: 0.16241103  Dif:  0.00002416 
##  Iter:   23  Fit: 0.16243173  Dif:  0.00002070 
##  Iter:   24  Fit: 0.16244948  Dif:  0.00001775 
##  Iter:   25  Fit: 0.16246473  Dif:  0.00001525 
##  Iter:   26  Fit: 0.16247785  Dif:  0.00001312 
##  Iter:   27  Fit: 0.16248914  Dif:  0.00001129 
##  Iter:   28  Fit: 0.16249888  Dif:  0.00000973 
##  Iter:   29  Fit: 0.16250727  Dif:  0.00000840 
##  Iter:   30  Fit: 0.16251452  Dif:  0.00000725 
##  Iter:   31  Fit: 0.16252078  Dif:  0.00000626 
##  Iter:   32  Fit: 0.16252620  Dif:  0.00000542 
##  Iter:   33  Fit: 0.16253089  Dif:  0.00000469 
##  Iter:   34  Fit: 0.16253495  Dif:  0.00000406 
##  Iter:   35  Fit: 0.16253846  Dif:  0.00000352 
##  Iter:   36  Fit: 0.16254151  Dif:  0.00000305 
##  Iter:   37  Fit: 0.16254415  Dif:  0.00000264 
##  Iter:   38  Fit: 0.16254644  Dif:  0.00000229 
##  Iter:   39  Fit: 0.16254843  Dif:  0.00000199 
##  Iter:   40  Fit: 0.16255016  Dif:  0.00000173 
##  Iter:   41  Fit: 0.16255166  Dif:  0.00000150 
##  Iter:   42  Fit: 0.16255296  Dif:  0.00000130 
##  Iter:   43  Fit: 0.16255409  Dif:  0.00000113 
##  Iter:   44  Fit: 0.16255508  Dif:  0.00000098 
##  Iter:   45  Fit: 0.16255593  Dif:  0.00000086 
##  Iter:   46  Fit: 0.16255668  Dif:  0.00000074 
##  Iter:   47  Fit: 0.16255733  Dif:  0.00000065 
##  Iter:   48  Fit: 0.16255789  Dif:  0.00000056 
##  Iter:   49  Fit: 0.16255838  Dif:  0.00000049 
##  Iter:   50  Fit: 0.16255881  Dif:  0.00000043 
##  Iter:   51  Fit: 0.16255918  Dif:  0.00000037 
##  Iter:   52  Fit: 0.16255950  Dif:  0.00000032 
##  Iter:   53  Fit: 0.16255978  Dif:  0.00000028 
##  Iter:   54  Fit: 0.16256003  Dif:  0.00000025 
##  Iter:   55  Fit: 0.16256024  Dif:  0.00000021 
##  Iter:   56  Fit: 0.16256043  Dif:  0.00000019 
##  Iter:   57  Fit: 0.16256059  Dif:  0.00000016 
##  Iter:   58  Fit: 0.16256073  Dif:  0.00000014 
##  Iter:   59  Fit: 0.16256086  Dif:  0.00000012 
##  Iter:   60  Fit: 0.16256096  Dif:  0.00000011 
##  Iter:   61  Fit: 0.16256106  Dif:  0.00000009 
##  Iter:   62  Fit: 0.16256114  Dif:  0.00000008 
##  Iter:   63  Fit: 0.16256121  Dif:  0.00000007 
##  Iter:   64  Fit: 0.16256127  Dif:  0.00000006 
##  Iter:   65  Fit: 0.16256132  Dif:  0.00000005 
##  Iter:   66  Fit: 0.16256137  Dif:  0.00000005 
##  Iter:   67  Fit: 0.16256141  Dif:  0.00000004 
##  Iter:   68  Fit: 0.16256145  Dif:  0.00000004 
##  Iter:   69  Fit: 0.16256148  Dif:  0.00000003 
##  Iter:   70  Fit: 0.16256151  Dif:  0.00000003 
##  Iter:   71  Fit: 0.16256153  Dif:  0.00000002 
##  Iter:   72  Fit: 0.16256155  Dif:  0.00000002 
##  Iter:   73  Fit: 0.16256157  Dif:  0.00000002 
##  Iter:   74  Fit: 0.16256158  Dif:  0.00000002 
##  Iter:   75  Fit: 0.16256160  Dif:  0.00000001 
##  Iter:   76  Fit: 0.16256161  Dif:  0.00000001 
##  Iter:   77  Fit: 0.16256162  Dif:  0.00000001 
##  Iter:   78  Fit: 0.16256163  Dif:  0.00000001 
## The RGCCA algorithm converged to a stationary point after 77 iterations

## Computation of the RGCCA block components #3 is under progress...
##  Iter:    1  Fit: 0.13317592  Dif:  0.08384585 
##  Iter:    2  Fit: 0.13672452  Dif:  0.00354860 
##  Iter:    3  Fit: 0.13820988  Dif:  0.00148536 
##  Iter:    4  Fit: 0.13886179  Dif:  0.00065191 
##  Iter:    5  Fit: 0.13916045  Dif:  0.00029866 
##  Iter:    6  Fit: 0.13930192  Dif:  0.00014148 
##  Iter:    7  Fit: 0.13937071  Dif:  0.00006879 
##  Iter:    8  Fit: 0.13940480  Dif:  0.00003409 
##  Iter:    9  Fit: 0.13942192  Dif:  0.00001712 
##  Iter:   10  Fit: 0.13943060  Dif:  0.00000868 
##  Iter:   11  Fit: 0.13943502  Dif:  0.00000443 
##  Iter:   12  Fit: 0.13943729  Dif:  0.00000227 
##  Iter:   13  Fit: 0.13943846  Dif:  0.00000117 
##  Iter:   14  Fit: 0.13943906  Dif:  0.00000060 
##  Iter:   15  Fit: 0.13943937  Dif:  0.00000031 
##  Iter:   16  Fit: 0.13943953  Dif:  0.00000016 
##  Iter:   17  Fit: 0.13943961  Dif:  0.00000008 
##  Iter:   18  Fit: 0.13943965  Dif:  0.00000004 
##  Iter:   19  Fit: 0.13943967  Dif:  0.00000002 
##  Iter:   20  Fit: 0.13943969  Dif:  0.00000001 
##  Iter:   21  Fit: 0.13943969  Dif:  0.00000001 
## The RGCCA algorithm converged to a stationary point after 20 iterations

## Computation of the RGCCA block components #4 is under progress...
##  Iter:    1  Fit: 0.06528250  Dif:  0.03990245 
##  Iter:    2  Fit: 0.06941548  Dif:  0.00413298 
##  Iter:    3  Fit: 0.07325078  Dif:  0.00383530 
##  Iter:    4  Fit: 0.07599337  Dif:  0.00274259 
##  Iter:    5  Fit: 0.07758116  Dif:  0.00158779 
##  Iter:    6  Fit: 0.07839886  Dif:  0.00081770 
##  Iter:    7  Fit: 0.07881103  Dif:  0.00041217 
##  Iter:    8  Fit: 0.07903196  Dif:  0.00022093 
##  Iter:    9  Fit: 0.07916471  Dif:  0.00013275 
##  Iter:   10  Fit: 0.07925468  Dif:  0.00008997 
##  Iter:   11  Fit: 0.07932147  Dif:  0.00006678 
##  Iter:   12  Fit: 0.07937388  Dif:  0.00005241 
##  Iter:   13  Fit: 0.07941628  Dif:  0.00004240 
##  Iter:   14  Fit: 0.07945114  Dif:  0.00003486 
##  Iter:   15  Fit: 0.07948007  Dif:  0.00002893 
##  Iter:   16  Fit: 0.07950422  Dif:  0.00002415 
##  Iter:   17  Fit: 0.07952448  Dif:  0.00002026 
##  Iter:   18  Fit: 0.07954153  Dif:  0.00001705 
##  Iter:   19  Fit: 0.07955594  Dif:  0.00001440 
##  Iter:   20  Fit: 0.07956813  Dif:  0.00001220 
##  Iter:   21  Fit: 0.07957850  Dif:  0.00001036 
##  Iter:   22  Fit: 0.07958732  Dif:  0.00000882 
##  Iter:   23  Fit: 0.07959484  Dif:  0.00000753 
##  Iter:   24  Fit: 0.07960127  Dif:  0.00000643 
##  Iter:   25  Fit: 0.07960678  Dif:  0.00000551 
##  Iter:   26  Fit: 0.07961150  Dif:  0.00000472 
##  Iter:   27  Fit: 0.07961555  Dif:  0.00000405 
##  Iter:   28  Fit: 0.07961903  Dif:  0.00000348 
##  Iter:   29  Fit: 0.07962203  Dif:  0.00000299 
##  Iter:   30  Fit: 0.07962461  Dif:  0.00000258 
##  Iter:   31  Fit: 0.07962682  Dif:  0.00000222 
##  Iter:   32  Fit: 0.07962874  Dif:  0.00000191 
##  Iter:   33  Fit: 0.07963039  Dif:  0.00000165 
##  Iter:   34  Fit: 0.07963181  Dif:  0.00000142 
##  Iter:   35  Fit: 0.07963304  Dif:  0.00000123 
##  Iter:   36  Fit: 0.07963409  Dif:  0.00000106 
##  Iter:   37  Fit: 0.07963501  Dif:  0.00000091 
##  Iter:   38  Fit: 0.07963580  Dif:  0.00000079 
##  Iter:   39  Fit: 0.07963648  Dif:  0.00000068 
##  Iter:   40  Fit: 0.07963707  Dif:  0.00000059 
##  Iter:   41  Fit: 0.07963758  Dif:  0.00000051 
##  Iter:   42  Fit: 0.07963802  Dif:  0.00000044 
##  Iter:   43  Fit: 0.07963840  Dif:  0.00000038 
##  Iter:   44  Fit: 0.07963873  Dif:  0.00000033 
##  Iter:   45  Fit: 0.07963901  Dif:  0.00000028 
##  Iter:   46  Fit: 0.07963926  Dif:  0.00000025 
##  Iter:   47  Fit: 0.07963947  Dif:  0.00000021 
##  Iter:   48  Fit: 0.07963965  Dif:  0.00000018 
##  Iter:   49  Fit: 0.07963981  Dif:  0.00000016 
##  Iter:   50  Fit: 0.07963995  Dif:  0.00000014 
##  Iter:   51  Fit: 0.07964007  Dif:  0.00000012 
##  Iter:   52  Fit: 0.07964017  Dif:  0.00000010 
##  Iter:   53  Fit: 0.07964026  Dif:  0.00000009 
##  Iter:   54  Fit: 0.07964033  Dif:  0.00000008 
##  Iter:   55  Fit: 0.07964040  Dif:  0.00000007 
##  Iter:   56  Fit: 0.07964046  Dif:  0.00000006 
##  Iter:   57  Fit: 0.07964051  Dif:  0.00000005 
##  Iter:   58  Fit: 0.07964055  Dif:  0.00000004 
##  Iter:   59  Fit: 0.07964059  Dif:  0.00000004 
##  Iter:   60  Fit: 0.07964062  Dif:  0.00000003 
##  Iter:   61  Fit: 0.07964064  Dif:  0.00000003 
##  Iter:   62  Fit: 0.07964067  Dif:  0.00000002 
##  Iter:   63  Fit: 0.07964069  Dif:  0.00000002 
##  Iter:   64  Fit: 0.07964071  Dif:  0.00000002 
##  Iter:   65  Fit: 0.07964072  Dif:  0.00000002 
##  Iter:   66  Fit: 0.07964074  Dif:  0.00000001 
##  Iter:   67  Fit: 0.07964075  Dif:  0.00000001 
##  Iter:   68  Fit: 0.07964076  Dif:  0.00000001 
## The RGCCA algorithm converged to a stationary point after 67 iterations

## Computation of the RGCCA block components #5 is under progress ... 
##  Iter:    1  Fit: 0.05985479  Dif:  0.03891161 
##  Iter:    2  Fit: 0.06154431  Dif:  0.00168952 
##  Iter:    3  Fit: 0.06214541  Dif:  0.00060111 
##  Iter:    4  Fit: 0.06243618  Dif:  0.00029077 
##  Iter:    5  Fit: 0.06258906  Dif:  0.00015288 
##  Iter:    6  Fit: 0.06267111  Dif:  0.00008205 
##  Iter:    7  Fit: 0.06271561  Dif:  0.00004450 
##  Iter:    8  Fit: 0.06273997  Dif:  0.00002436 
##  Iter:    9  Fit: 0.06275341  Dif:  0.00001344 
##  Iter:   10  Fit: 0.06276088  Dif:  0.00000747 
##  Iter:   11  Fit: 0.06276506  Dif:  0.00000417 
##  Iter:   12  Fit: 0.06276740  Dif:  0.00000234 
##  Iter:   13  Fit: 0.06276872  Dif:  0.00000132 
##  Iter:   14  Fit: 0.06276946  Dif:  0.00000074 
##  Iter:   15  Fit: 0.06276988  Dif:  0.00000042 
##  Iter:   16  Fit: 0.06277012  Dif:  0.00000024 
##  Iter:   17  Fit: 0.06277026  Dif:  0.00000013 
##  Iter:   18  Fit: 0.06277033  Dif:  0.00000008 
##  Iter:   19  Fit: 0.06277038  Dif:  0.00000004 
##  Iter:   20  Fit: 0.06277040  Dif:  0.00000002 
##  Iter:   21  Fit: 0.06277041  Dif:  0.00000001 
##  Iter:   22  Fit: 0.06277042  Dif:  0.00000001 
## The RGCCA algorithm converged to a stationary point after 21 iterations

mod3 <- sgcca(A3, C3, c1 = c(mod3_params$tau[1,1], mod3_params$tau[1,2], mod3_params$tau[1,3], 1), 
              ncomp = c(5,5,5,1), scheme = 'factorial', scale = TRUE, verbose = TRUE)
## Computation of the SGCCA block components based on the factorial scheme 
## Computation of the SGCCA block components #1 is under progress... 
##  Iter:    1  Fit:  0.01055749  Dif:  -0.32372982 
##  Iter:    2  Fit:  0.01276281  Dif:  0.00220533 
##  Iter:    3  Fit:  0.01320712  Dif:  0.00044431 
##  Iter:    4  Fit:  0.01325635  Dif:  0.00004922 
##  Iter:    5  Fit:  0.01326325  Dif:  0.00000691 
##  Iter:    6  Fit:  0.01326484  Dif:  0.00000158 
##  Iter:    7  Fit:  0.01326523  Dif:  0.00000039 
##  Iter:    8  Fit:  0.01326533  Dif:  0.00000010 
##  Iter:    9  Fit:  0.01326535  Dif:  0.00000002 
##  Iter:   10  Fit:  0.01326536  Dif:  0.00000001 
##  Iter:   11  Fit:  0.01326536  Dif:  0.00000000 
##  Iter:   12  Fit:  0.01326536  Dif:  0.00000000 
##  Iter:   13  Fit:  0.01326536  Dif:  0.00000000 
##  Iter:   14  Fit:  0.01326536  Dif:  0.00000000 
##  Iter:   15  Fit:  0.01326536  Dif:  0.00000000 
##  Iter:   16  Fit:  0.01326536  Dif:  0.00000000 
##  Iter:   17  Fit:  0.01326536  Dif:  -0.00000000 
##  Iter:   18  Fit:  0.01326536  Dif:  -0.00000000 
##  Iter:   19  Fit:  0.01326536  Dif:  0.00000000 
##  Iter:   20  Fit:  0.01326536  Dif:  0.00000000 
##  Iter:   21  Fit:  0.01326536  Dif:  -0.00000000 
##  Iter:   22  Fit:  0.01326536  Dif:  0.00000000 
##  Iter:   23  Fit:  0.01326536  Dif:  -0.00000000 
##  Iter:   24  Fit:  0.01326536  Dif:  0.00000000 
##  Iter:   25  Fit:  0.01326536  Dif:  0.00000000 
##  Iter:   26  Fit:  0.01326536  Dif:  -0.00000000 
##  Iter:   27  Fit:  0.01326536  Dif:  0.00000000 
## The SGCCA algorithm converged to a stationary point after 26 iterations

## Computation of the SGCCA block components #2 is under progress... 
##  Iter:    1  Fit:  0.00686580  Dif:  -0.07045048 
##  Iter:    2  Fit:  0.00835253  Dif:  0.00148673 
##  Iter:    3  Fit:  0.00861442  Dif:  0.00026189 
##  Iter:    4  Fit:  0.00957968  Dif:  0.00096526 
##  Iter:    5  Fit:  0.01065288  Dif:  0.00107320 
##  Iter:    6  Fit:  0.01075350  Dif:  0.00010062 
##  Iter:    7  Fit:  0.01076889  Dif:  0.00001539 
##  Iter:    8  Fit:  0.01077198  Dif:  0.00000309 
##  Iter:    9  Fit:  0.01077256  Dif:  0.00000057 
##  Iter:   10  Fit:  0.01077266  Dif:  0.00000010 
##  Iter:   11  Fit:  0.01077267  Dif:  0.00000002 
##  Iter:   12  Fit:  0.01077268  Dif:  0.00000000 
##  Iter:   13  Fit:  0.01077268  Dif:  0.00000000 
##  Iter:   14  Fit:  0.01077268  Dif:  0.00000000 
##  Iter:   15  Fit:  0.01077268  Dif:  0.00000000 
##  Iter:   16  Fit:  0.01077268  Dif:  0.00000000 
##  Iter:   17  Fit:  0.01077268  Dif:  0.00000000 
##  Iter:   18  Fit:  0.01077268  Dif:  -0.00000000 
##  Iter:   19  Fit:  0.01077268  Dif:  0.00000000 
##  Iter:   20  Fit:  0.01077268  Dif:  -0.00000000 
##  Iter:   21  Fit:  0.01077268  Dif:  0.00000000 
##  Iter:   22  Fit:  0.01077268  Dif:  -0.00000000 
##  Iter:   23  Fit:  0.01077268  Dif:  0.00000000 
##  Iter:   24  Fit:  0.01077268  Dif:  -0.00000000 
##  Iter:   25  Fit:  0.01077268  Dif:  0.00000000 
## The SGCCA algorithm converged to a stationary point after 24 iterations

## Computation of the SGCCA block components #3 is under progress... 
##  Iter:    1  Fit:  0.00326439  Dif:  -0.03436047 
##  Iter:    2  Fit:  0.00361068  Dif:  0.00034629 
##  Iter:    3  Fit:  0.00363784  Dif:  0.00002716 
##  Iter:    4  Fit:  0.00364058  Dif:  0.00000274 
##  Iter:    5  Fit:  0.00364098  Dif:  0.00000040 
##  Iter:    6  Fit:  0.00364105  Dif:  0.00000007 
##  Iter:    7  Fit:  0.00364106  Dif:  0.00000001 
##  Iter:    8  Fit:  0.00364106  Dif:  0.00000000 
##  Iter:    9  Fit:  0.00364106  Dif:  0.00000000 
##  Iter:   10  Fit:  0.00364106  Dif:  0.00000000 
##  Iter:   11  Fit:  0.00364106  Dif:  0.00000000 
##  Iter:   12  Fit:  0.00364106  Dif:  0.00000000 
##  Iter:   13  Fit:  0.00364106  Dif:  0.00000000 
##  Iter:   14  Fit:  0.00364106  Dif:  0.00000000 
##  Iter:   15  Fit:  0.00364106  Dif:  -0.00000000 
##  Iter:   16  Fit:  0.00364106  Dif:  0.00000000 
##  Iter:   17  Fit:  0.00364106  Dif:  0.00000000 
##  Iter:   18  Fit:  0.00364106  Dif:  0.00000000 
##  Iter:   19  Fit:  0.00364106  Dif:  -0.00000000 
##  Iter:   20  Fit:  0.00364106  Dif:  0.00000000 
##  Iter:   21  Fit:  0.00364106  Dif:  0.00000000 
##  Iter:   22  Fit:  0.00364106  Dif:  -0.00000000 
##  Iter:   23  Fit:  0.00364106  Dif:  -0.00000000 
## The SGCCA algorithm converged to a stationary point after 22 iterations

## Computation of the SGCCA block components #4 is under progress... 
##  Iter:    1  Fit:  0.00295453  Dif:  -0.00288861 
##  Iter:    2  Fit:  0.00409040  Dif:  0.00113587 
##  Iter:    3  Fit:  0.00482607  Dif:  0.00073566 
##  Iter:    4  Fit:  0.00526161  Dif:  0.00043554 
##  Iter:    5  Fit:  0.00559794  Dif:  0.00033633 
##  Iter:    6  Fit:  0.00568200  Dif:  0.00008406 
##  Iter:    7  Fit:  0.00568408  Dif:  0.00000208 
##  Iter:    8  Fit:  0.00568411  Dif:  0.00000003 
##  Iter:    9  Fit:  0.00568411  Dif:  0.00000000 
##  Iter:   10  Fit:  0.00568411  Dif:  0.00000000 
##  Iter:   11  Fit:  0.00568411  Dif:  0.00000000 
##  Iter:   12  Fit:  0.00568411  Dif:  0.00000000 
##  Iter:   13  Fit:  0.00568411  Dif:  -0.00000000 
##  Iter:   14  Fit:  0.00568411  Dif:  -0.00000000 
##  Iter:   15  Fit:  0.00568411  Dif:  0.00000000 
##  Iter:   16  Fit:  0.00568411  Dif:  0.00000000 
##  Iter:   17  Fit:  0.00568411  Dif:  -0.00000000 
##  Iter:   18  Fit:  0.00568411  Dif:  0.00000000 
## The SGCCA algorithm converged to a stationary point after 17 iterations

## Computation of the SGCCA block components #5 is under progress...
##  Iter:    1  Fit:  0.00237991  Dif:  -0.00199228 
##  Iter:    2  Fit:  0.00284959  Dif:  0.00046968 
##  Iter:    3  Fit:  0.00290537  Dif:  0.00005578 
##  Iter:    4  Fit:  0.00291659  Dif:  0.00001122 
##  Iter:    5  Fit:  0.00291899  Dif:  0.00000240 
##  Iter:    6  Fit:  0.00291969  Dif:  0.00000070 
##  Iter:    7  Fit:  0.00291993  Dif:  0.00000024 
##  Iter:    8  Fit:  0.00292002  Dif:  0.00000009 
##  Iter:    9  Fit:  0.00292005  Dif:  0.00000004 
##  Iter:   10  Fit:  0.00292007  Dif:  0.00000002 
##  Iter:   11  Fit:  0.00292008  Dif:  0.00000001 
##  Iter:   12  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   13  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   14  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   15  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   16  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   17  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   18  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   19  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   20  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   21  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   22  Fit:  0.00292008  Dif:  -0.00000000 
##  Iter:   23  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   24  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   25  Fit:  0.00292008  Dif:  -0.00000000 
##  Iter:   26  Fit:  0.00292008  Dif:  -0.00000000 
##  Iter:   27  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   28  Fit:  0.00292008  Dif:  -0.00000000 
##  Iter:   29  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   30  Fit:  0.00292008  Dif:  -0.00000000 
##  Iter:   31  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   32  Fit:  0.00292008  Dif:  -0.00000000 
##  Iter:   33  Fit:  0.00292008  Dif:  -0.00000000 
##  Iter:   34  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   35  Fit:  0.00292008  Dif:  -0.00000000 
##  Iter:   36  Fit:  0.00292008  Dif:  -0.00000000 
##  Iter:   37  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   38  Fit:  0.00292008  Dif:  0.00000000 
##  Iter:   39  Fit:  0.00292008  Dif:  0.00000000 
## The SGCCA algorithm converged to a stationary point after 38 iterations

df3 <- data.frame(Trt = multiset_meta$treatment,
                  Grp = multiset_meta$group,
                  Subject = multiset_meta$fecal_sample,
                  ASV1 = mod3$Y[[1]][,1],
                  ASV2 = mod3$Y[[1]][,2],
                  ASV3 = mod3$Y[[1]][,3],
                  FT1 = mod3$Y[[2]][,1],
                  FT2 = mod3$Y[[2]][,2],
                  FT3 = mod3$Y[[2]][,3],
                  SB1 = mod3$Y[[3]][,1],
                  SB2 = mod3$Y[[3]][,2],
                  SB3 = mod3$Y[[3]][,3],
                  SB4 = mod3$Y[[3]][,4],
                  SB5 = mod3$Y[[3]][,5])

ggplot(df3, aes(x = SB1, y = SB4, color = Trt)) +
  geom_point(aes(shape = Grp), size = 3) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  stat_ellipse(aes(group = Trt, fill = Trt), geom = 'polygon', alpha = 0.4)  +
  theme(legend.position="bottom", legend.box = "horizontal",
        legend.title = element_blank()) +
  ggtitle('Model 3') +
  theme_minimal_grid() +
  xlab('Superblock Component 1') +
  ylab('Superblock Component 4')

For model 3, on the 4th component we saw good separation all 4 groups, specifically broc and brus which we have not achieved before. Like before, we will examin high loading features.

mod3_ft <- mod3$a[[3]][mod3$a[[3]][,1] != 0 | mod3$a[[3]][,4] != 0, ] %>%
  as.data.frame() %>%
  dplyr::select(V1, V4) %>%
  rownames_to_column('feature') %>%
  rename('comp1' = V1, 'comp4' = V4) %>%
  mutate(type = ifelse(str_detect(feature, 'ASV'), 'ASV', 'FT'))

mod3_ft %>%
  dplyr::filter(type == 'ASV')
##   feature      comp1        comp4 type
## 1  ASV291 0.00000000  0.043632520  ASV
## 2  ASV415 0.00000000  0.001345626  ASV
## 3 ASV1487 0.00000000 -0.039269368  ASV
## 4 ASV1584 0.02146219  0.000000000  ASV
## 5 ASV1609 0.00000000 -0.008441891  ASV
## 6 ASV1736 0.00000000 -0.030990774  ASV
## 7 ASV2581 0.00000000 -0.017354076  ASV

We see 7 ASVs coming out of our model this time. On component 1 we only see ASV1584 as not being shrunk to 0 and it is associated with the combo which is expected. ASV291 corresponds to the Lachnospiraceae family was found to be associated with broccoli and studies in rats have shown that it decreases with cruciferous vegetable consumption while studies in mice shows that it increases. ASV1736 which corresponds to genus Oscillibacter and has been shown to increase with cruciferous vegetable consumption in rats and was found here to be associated with brussels sprouts. While a more comprehensive literature search is needed, the fact that known and previously observed ASVs are popping up seems promising.

Switching gears to metabolomic features, we see 925 metabolomic features coming back with non-zero loadings.

trt_ft_nc <- mod3_ft %>%
  filter(type == 'FT') %>%
  arrange(abs(comp1)) %>%
  tail(n = 30) %>%
  mutate(signal = ifelse(comp1 > 0, 'combo', 'noveg'))

nc_comp <- combined_tidy %>%
  dplyr::filter(feature %in% trt_ft_nc$feature[c(30:28, 23, 18, 10)]) %>%
  left_join(trt_ft_nc)

ggplot(nc_comp, aes(x = treatment, y = intensity, color = signal)) +
  geom_point() +
  stat_summary(fun = 'mean', geom = 'point', color = 'black', size = 4, shape = 18) +
  facet_wrap(~feature, ncol = 3) +
  ggtitle('Component 1 Features of Interest')

trt_ft_bb <- mod3_ft %>%
  filter(type == 'FT') %>%
  arrange(abs(comp4)) %>%
  tail(n = 7) %>%
  mutate(signal = ifelse(comp4 > 0, 'broc', 'brus'))

bb_comp <- combined_tidy %>%
  dplyr::filter(feature %in% trt_ft_bb$feature[c(1,3:7)]) %>%
  left_join(trt_ft_bb)

ggplot(bb_comp, aes(x = treatment, y = intensity, color = signal)) +
  geom_point() +
  stat_summary(fun = 'mean', geom = 'point', color = 'black', size = 4, shape = 18) +
  facet_wrap(~feature, ncol = 3) +
  ggtitle('Component 4 Features of Interest')

trt_ft <- mod3_ft %>%
  filter(type == 'FT')

As before, we see the same trend as before for component 1 (No Veg vs Combo) where typically features associated with no-veg have a slightly higher mean intensity in no-veg and a lowest overall mean intensity in combo. For component 4 (Brus vs Broc) there is a clear difference between features which are associated with broc and those associated with brus. Overall, 900 of the 925 metabolomic (97%) features from this model were identified as significantly different in the RMANOVA accounting for 15% of the total features found in the RMANOVA.

Question 3

Is the high overlap between the features pulled from each model and those from GLMM/RMANOVA a red flag? Are my models biased and not selecting meaningful features?

I plan on conducting LOOCV to these models to evaluate feature importance in these models as a further validation step.

Once metabolite identification is done I also plan on searching the literature to find links between metabolites and microbes which are both associated with specific treatment groups, specifically if they have similar loadings to one another.

Regression Analysis

As a final analysis, I plan on using remMap to conduct a regularized regression analysis on this dataset. remMap utilized L1 and L2 penalization to generate a sparse model yielding betas for predictors that correspond to specific response variables. In this model, metabolomic features will be used as the predictors and ASVs will be used as a response variables (remMap requires p > q).

#Run on the shell since computationally intensive
#lamL1.v <- seq(1,11, by=1)
#lamL2.v <- seq(11,21, by=1)
#
#try3 <- remMap.CV(X=X_mat, Y=Y_mat,lamL1.v, lamL2.v, C.m=NULL, fold=5, seed=1)
#pick <- which.min(as.vector(try3$ols.cv))
#lamL1.pick <- try3$l.index[1,pick] ## find the optimal (LamL1,LamL2) based on the cv score
#lamL2.pick <- try3$l.index[2,pick]
#
#result <- remMap(X_mat, Y_mat,lamL1=lamL1.pick, lamL2=lamL2.pick, phi0=NULL, C.m=NULL)

result <- readRDS('./remMap/remMapRes.RDS')

final <- result$phi
rownames(final) <- colnames(ft_block)
colnames(final) <- colnames(asv_block)

final[apply(final, 1, sum) != 0, ]
##                              ASV1 ASV2         ASV4 ASV49 ASV68 ASV71 ASV72
## FT_4.23_907.2967_neg -0.012827447    0  0.008799967     0     0     0     0
## FT_4.57_269.9680_pos -0.017752231    0  0.013439985     0     0     0     0
## FT_7.28_570.9858_pos  0.007100841    0 -0.004511694     0     0     0     0
##                      ASV76 ASV83 ASV85 ASV89 ASV91 ASV93 ASV96 ASV130 ASV132
## FT_4.23_907.2967_neg     0     0     0     0     0     0     0      0      0
## FT_4.57_269.9680_pos     0     0     0     0     0     0     0      0      0
## FT_7.28_570.9858_pos     0     0     0     0     0     0     0      0      0
##                      ASV134 ASV141 ASV145 ASV148 ASV151 ASV155 ASV157 ASV168
## FT_4.23_907.2967_neg      0      0      0      0      0      0      0      0
## FT_4.57_269.9680_pos      0      0      0      0      0      0      0      0
## FT_7.28_570.9858_pos      0      0      0      0      0      0      0      0
##                      ASV172 ASV229 ASV241 ASV256 ASV277 ASV286 ASV291 ASV310
## FT_4.23_907.2967_neg      0      0      0      0      0      0      0      0
## FT_4.57_269.9680_pos      0      0      0      0      0      0      0      0
## FT_7.28_570.9858_pos      0      0      0      0      0      0      0      0
##                      ASV354 ASV398 ASV415 ASV418 ASV427 ASV439 ASV457 ASV461
## FT_4.23_907.2967_neg      0      0      0      0      0      0      0      0
## FT_4.57_269.9680_pos      0      0      0      0      0      0      0      0
## FT_7.28_570.9858_pos      0      0      0      0      0      0      0      0
##                      ASV462 ASV482 ASV487 ASV495 ASV501 ASV514 ASV556 ASV573
## FT_4.23_907.2967_neg      0      0      0      0      0      0      0      0
## FT_4.57_269.9680_pos      0      0      0      0      0      0      0      0
## FT_7.28_570.9858_pos      0      0      0      0      0      0      0      0
##                      ASV584 ASV625 ASV683 ASV805 ASV820 ASV872 ASV875 ASV892
## FT_4.23_907.2967_neg      0      0      0      0      0      0      0      0
## FT_4.57_269.9680_pos      0      0      0      0      0      0      0      0
## FT_7.28_570.9858_pos      0      0      0      0      0      0      0      0
##                      ASV914 ASV927 ASV934 ASV993 ASV1040 ASV1104 ASV1122
## FT_4.23_907.2967_neg      0      0      0      0       0       0       0
## FT_4.57_269.9680_pos      0      0      0      0       0       0       0
## FT_7.28_570.9858_pos      0      0      0      0       0       0       0
##                      ASV1167 ASV1263 ASV1303 ASV1487 ASV1584 ASV1609 ASV1736
## FT_4.23_907.2967_neg       0       0       0       0       0       0       0
## FT_4.57_269.9680_pos       0       0       0       0       0       0       0
## FT_7.28_570.9858_pos       0       0       0       0       0       0       0
##                      ASV1983 ASV2581
## FT_4.23_907.2967_neg       0       0
## FT_4.57_269.9680_pos       0       0
## FT_7.28_570.9858_pos       0       0
rmft <- rownames(final[apply(final, 1, sum) != 0, ])

spearman %>%
  filter(padj <= 0.05) %>%
  filter(feature %in% rmft) %>%
  filter(ASV %in% c('ASV1', 'ASV4'))
## # A tibble: 6 × 7
##   feature              ASV   data               spm        rho     pval     padj
##   <chr>                <chr> <list>             <list>   <dbl>    <dbl>    <dbl>
## 1 FT_4.23_907.2967_neg ASV1  <tibble [40 × 10]> <htest> -0.828 4.30e-11  8.99e-7
## 2 FT_4.23_907.2967_neg ASV4  <tibble [40 × 10]> <htest>  0.647 6.32e- 6  1.94e-3
## 3 FT_4.57_269.9680_pos ASV1  <tibble [40 × 10]> <htest> -0.779 3.28e- 9  8.40e-6
## 4 FT_4.57_269.9680_pos ASV4  <tibble [40 × 10]> <htest>  0.698 5.47e- 7  3.40e-4
## 5 FT_7.28_570.9858_pos ASV1  <tibble [40 × 10]> <htest>  0.753 2.13e- 8  3.01e-5
## 6 FT_7.28_570.9858_pos ASV4  <tibble [40 × 10]> <htest> -0.552 2.22e- 4  1.96e-2

RemMap gave us a highly sparse model with only 3 features not being shrunk to 0. Additionally, all these features had fairly small betas. Unsurprisingly, the ASVs that remMap was found to be influence were solely ASV1 and ASV4 which represent family Enterobacteriaceae and Clostridiaceae, respectively. Unsurprisingly, in our spearman analysis we found that all three features had significant correlations with the two ASVs, the direction of which matched the sign of the betas.

RemMap has two strategies to determine the shrinkage parameter, one based on the shrinked estimater (RSS) and based on the unshrinked estimated (RSS). OLS is recommended as RSS tends to yield larger models so we will try that next.

#Run on the shell since computationally intensive
#lamL1.v <- seq(1,11, by=1)
#lamL2.v <- seq(11,21, by=1)
#
#try3 <- remMap.CV(X=X_mat, Y=Y_mat,lamL1.v, lamL2.v, C.m=NULL, fold=5, seed=1)
#pick <- which.min(as.vector(try3$rss.cv))
#lamL1.pick <- try3$l.index[1,pick] ## find the optimal (LamL1,LamL2) based on the cv score
#lamL2.pick <- try3$l.index[2,pick]
#
#result <- remMap(X_mat, Y_mat,lamL1=lamL1.pick, lamL2=lamL2.pick, phi0=NULL, C.m=NULL)

remMapRSS <- readRDS('./remMap/remMapRes_rss.RDS')
final_rss <- remMapRSS$phi

rownames(final_rss) <- colnames(ft_block)
colnames(final_rss) <- colnames(asv_block)

final_rss[apply(final_rss, 1, sum) != 0, ]
##                                ASV1 ASV2          ASV4 ASV49 ASV68 ASV71 ASV72
## FT_4.23_907.2967_neg  -2.126679e-02    0  0.0152492388     0     0     0     0
## FT_5.37_161.0060_neg  -5.484552e-04    0  0.0005679717     0     0     0     0
## FT_22.87_166.0858_pos  4.064068e-07    0  0.0000000000     0     0     0     0
## FT_4.57_269.9680_pos  -2.022381e-02    0  0.0159197539     0     0     0     0
## FT_7.28_570.9858_pos   3.206878e-03    0 -0.0021141747     0     0     0     0
## FT_4.20_675.3031_pos  -1.780895e-03    0  0.0013044583     0     0     0     0
##                       ASV76 ASV83 ASV85 ASV89 ASV91 ASV93 ASV96 ASV130 ASV132
## FT_4.23_907.2967_neg      0     0     0     0     0     0     0      0      0
## FT_5.37_161.0060_neg      0     0     0     0     0     0     0      0      0
## FT_22.87_166.0858_pos     0     0     0     0     0     0     0      0      0
## FT_4.57_269.9680_pos      0     0     0     0     0     0     0      0      0
## FT_7.28_570.9858_pos      0     0     0     0     0     0     0      0      0
## FT_4.20_675.3031_pos      0     0     0     0     0     0     0      0      0
##                       ASV134 ASV141 ASV145 ASV148 ASV151 ASV155 ASV157 ASV168
## FT_4.23_907.2967_neg       0      0      0      0      0      0      0      0
## FT_5.37_161.0060_neg       0      0      0      0      0      0      0      0
## FT_22.87_166.0858_pos      0      0      0      0      0      0      0      0
## FT_4.57_269.9680_pos       0      0      0      0      0      0      0      0
## FT_7.28_570.9858_pos       0      0      0      0      0      0      0      0
## FT_4.20_675.3031_pos       0      0      0      0      0      0      0      0
##                       ASV172 ASV229 ASV241 ASV256 ASV277 ASV286 ASV291 ASV310
## FT_4.23_907.2967_neg       0      0      0      0      0      0      0      0
## FT_5.37_161.0060_neg       0      0      0      0      0      0      0      0
## FT_22.87_166.0858_pos      0      0      0      0      0      0      0      0
## FT_4.57_269.9680_pos       0      0      0      0      0      0      0      0
## FT_7.28_570.9858_pos       0      0      0      0      0      0      0      0
## FT_4.20_675.3031_pos       0      0      0      0      0      0      0      0
##                       ASV354 ASV398 ASV415 ASV418 ASV427 ASV439 ASV457 ASV461
## FT_4.23_907.2967_neg       0      0      0      0      0      0      0      0
## FT_5.37_161.0060_neg       0      0      0      0      0      0      0      0
## FT_22.87_166.0858_pos      0      0      0      0      0      0      0      0
## FT_4.57_269.9680_pos       0      0      0      0      0      0      0      0
## FT_7.28_570.9858_pos       0      0      0      0      0      0      0      0
## FT_4.20_675.3031_pos       0      0      0      0      0      0      0      0
##                       ASV462 ASV482 ASV487 ASV495 ASV501 ASV514 ASV556 ASV573
## FT_4.23_907.2967_neg       0      0      0      0      0      0      0      0
## FT_5.37_161.0060_neg       0      0      0      0      0      0      0      0
## FT_22.87_166.0858_pos      0      0      0      0      0      0      0      0
## FT_4.57_269.9680_pos       0      0      0      0      0      0      0      0
## FT_7.28_570.9858_pos       0      0      0      0      0      0      0      0
## FT_4.20_675.3031_pos       0      0      0      0      0      0      0      0
##                       ASV584 ASV625 ASV683 ASV805 ASV820 ASV872 ASV875 ASV892
## FT_4.23_907.2967_neg       0      0      0      0      0      0      0      0
## FT_5.37_161.0060_neg       0      0      0      0      0      0      0      0
## FT_22.87_166.0858_pos      0      0      0      0      0      0      0      0
## FT_4.57_269.9680_pos       0      0      0      0      0      0      0      0
## FT_7.28_570.9858_pos       0      0      0      0      0      0      0      0
## FT_4.20_675.3031_pos       0      0      0      0      0      0      0      0
##                       ASV914 ASV927 ASV934 ASV993 ASV1040 ASV1104 ASV1122
## FT_4.23_907.2967_neg       0      0      0      0       0       0       0
## FT_5.37_161.0060_neg       0      0      0      0       0       0       0
## FT_22.87_166.0858_pos      0      0      0      0       0       0       0
## FT_4.57_269.9680_pos       0      0      0      0       0       0       0
## FT_7.28_570.9858_pos       0      0      0      0       0       0       0
## FT_4.20_675.3031_pos       0      0      0      0       0       0       0
##                       ASV1167 ASV1263 ASV1303 ASV1487 ASV1584 ASV1609 ASV1736
## FT_4.23_907.2967_neg        0       0       0       0       0       0       0
## FT_5.37_161.0060_neg        0       0       0       0       0       0       0
## FT_22.87_166.0858_pos       0       0       0       0       0       0       0
## FT_4.57_269.9680_pos        0       0       0       0       0       0       0
## FT_7.28_570.9858_pos        0       0       0       0       0       0       0
## FT_4.20_675.3031_pos        0       0       0       0       0       0       0
##                       ASV1983 ASV2581
## FT_4.23_907.2967_neg        0       0
## FT_5.37_161.0060_neg        0       0
## FT_22.87_166.0858_pos       0       0
## FT_4.57_269.9680_pos        0       0
## FT_7.28_570.9858_pos        0       0
## FT_4.20_675.3031_pos        0       0
rmft_rss <- rownames(rmft_rss <- final_rss[apply(final_rss, 1, sum) != 0, ])

spearman %>%
  filter(padj <= 0.05) %>%
  filter(feature %in% rmft_rss) %>%
  filter(ASV %in% c('ASV1', 'ASV4'))
## # A tibble: 10 × 7
##    feature              ASV   data               spm        rho     pval    padj
##    <chr>                <chr> <list>             <list>   <dbl>    <dbl>   <dbl>
##  1 FT_4.23_907.2967_neg ASV1  <tibble [40 × 10]> <htest> -0.828 4.30e-11 8.99e-7
##  2 FT_4.23_907.2967_neg ASV4  <tibble [40 × 10]> <htest>  0.647 6.32e- 6 1.94e-3
##  3 FT_5.37_161.0060_neg ASV1  <tibble [40 × 10]> <htest> -0.710 2.94e- 7 2.20e-4
##  4 FT_5.37_161.0060_neg ASV4  <tibble [40 × 10]> <htest>  0.715 2.24e- 7 1.80e-4
##  5 FT_4.57_269.9680_pos ASV1  <tibble [40 × 10]> <htest> -0.779 3.28e- 9 8.40e-6
##  6 FT_4.57_269.9680_pos ASV4  <tibble [40 × 10]> <htest>  0.698 5.47e- 7 3.40e-4
##  7 FT_7.28_570.9858_pos ASV1  <tibble [40 × 10]> <htest>  0.753 2.13e- 8 3.01e-5
##  8 FT_7.28_570.9858_pos ASV4  <tibble [40 × 10]> <htest> -0.552 2.22e- 4 1.96e-2
##  9 FT_4.20_675.3031_pos ASV1  <tibble [40 × 10]> <htest> -0.785 1.99e- 9 5.70e-6
## 10 FT_4.20_675.3031_pos ASV4  <tibble [40 × 10]> <htest>  0.655 4.57e- 6 1.57e-3

Out of this model, we see 3 new features being selected in our model compared to the first iteration. Once again, all the selected features have significant spearman correlations, except for FT_22.87_166.0858 which also has the smallest beta of the selected features.

Question 4

Would it be beneficial to run other regression analyses such as elastic net? remMap cannot deal with mixed models, as we have here, would using a package like lmmen or glmmLasso which create a mixed model be more useful than remMap?

Next Steps

Feature identification for the metabolomics data is currently on going and we are currently focusing on the features output from the GLMM and RMANOVA models for identification. Once these are completed I plan to begin drafting a manuscript of this data with a goal for the first draft being done by the end of winter term.

General Questions

  1. Do the methods and interpreation I’ve used seem correct/applicable so far? Any major red flags that are sticking out?
  2. Are there any other analyses or tools that you are currently working on that you think would be useful in this analysis?
  3. Any suggestions on further incorporating in the spearman analysis? My thoughts were to go back and see how/if the ASVs/FTs which were associated with similar treatment groups or microbial profiles had significant spearman correlations? My other thought would be 2 run the spearman analysis 2 more times, once stratified by treatment and once by microbial profile to see if it changes the results and how that aligns with other results of the paper.